Scylla: An Intelligent Proxy Pool for Humanities™

An intelligent proxy pool for humanities, only supports Python 3.6. Key features:

  • Automatic proxy ip crawling and validation
  • Easy-to-use JSON API
  • Simple but beautiful web-based user interface (eg. geographical distribution of proxies)
  • Get started with only 1 command minimally
  • Simple HTTP Forward proxy server
  • Scrapy and requests integration with only 1 line of code minimally
  • Headless browser crawling

对于偏好中文的用户,请阅读 中文文档。For those who prefer to use Chinese, please read the Chinese Documentation

Get started

Installation

Install directly via pip

pip install scylla
scylla --help
scylla # Run the cralwer and web server for JSON API

Install from source

git clone https://github.com/imWildCat/scylla.git
cd scylla

pip install -r requirements.txt

npm install # or yarn install
make build-assets

python -m scylla

Usage

This is an example of running a service locally (localhost), using port 8899.

Note: You might have to wait for 1 to 2 minutes in order to get some proxy ips populated in the database for the first time you use Scylla.

JSON API

Proxy IP List
http://localhost:8899/api/v1/proxies

Optional URL parameters:

Parameters Default value Description
page 1 The page number
limit 20 The number of proxies shown on each page
anonymous any Show anonymous proxies or not. Possible values:true, only anonymous proxies; false, only transparent proxies
https any Show HTTPS proxies or not. Possible values:true, only HTTPS proxies; false, only HTTP proxies
countries None Filter proxies for specific countries. Format example: US, or multi-countries: US,GB

Sample result:

{
    "proxies": [{
        "id": 599,
        "ip": "91.229.222.163",
        "port": 53281,
        "is_valid": true,
        "created_at": 1527590947,
        "updated_at": 1527593751,
        "latency": 23.0,
        "stability": 0.1,
        "is_anonymous": true,
        "is_https": true,
        "attempts": 1,
        "https_attempts": 0,
        "location": "54.0451,-0.8053",
        "organization": "AS57099 Boundless Networks Limited",
        "region": "England",
        "country": "GB",
        "city": "Malton"
    }, {
        "id": 75,
        "ip": "75.151.213.85",
        "port": 8080,
        "is_valid": true,
        "created_at": 1527590676,
        "updated_at": 1527593702,
        "latency": 268.0,
        "stability": 0.3,
        "is_anonymous": true,
        "is_https": true,
        "attempts": 1,
        "https_attempts": 0,
        "location": "32.3706,-90.1755",
        "organization": "AS7922 Comcast Cable Communications, LLC",
        "region": "Mississippi",
        "country": "US",
        "city": "Jackson"
    },
    ...
    ],
    "count": 1025,
    "per_page": 20,
    "page": 1,
    "total_page": 52
}
System Statistics
http://localhost:8899/api/v1/stats

Sample result:

{
    "median": 181.2566407083,
    "valid_count": 1780,
    "total_count": 9528,
    "mean": 174.3290085201
}

HTTP Forward Proxy Server

By default, Scylla will start a HTTP Forward Proxy Server on port 8081. This server will select one proxy updated recently from the database and it will be used for forward proxy. Whenever an HTTP request comes, the proxy server will select a proxy randomly.

Note: HTTPS requests are not supported at present.

The example for curl using this proxy server is shown below:

curl http://api.ipify.org -x http://127.0.0.1:8081

You could also use this feature with requests:

requests.get('http://api.ipify.org', proxies={'http': 'http://127.0.0.1:8081'})

Web UI

Open http://localhost:8899 in your browser to see the Web UI of this project.

Proxy IP List
http://localhost:8899/

Screenshot:

screenshot-proxy-list

Globally Geographical Distribution Map
http://localhost:8899/#/geo

Screenshot:

screenshot-geo-distribution

Other Examples

Example with Requests

Requests is a very nice and mature HTTP library for Python. To use Scylla with this library is very easy.

With the JSON API

import requests
import random

json_resp = requests.get('http://localhost:8899/api/v1/proxies').json()
proxy = random.choice(json_resp['proxies'])

requests.get('http://api.ipify.org', proxies={'http': 'http://{}:{}'.format(proxy['ip'], proxy['port'])})

HTTPS proxy is also supported as well:

import requests
import random

json_resp = requests.get('http://localhost:8899/api/v1/proxies?https=true').json()
proxy = random.choice(json_resp['proxies'])

requests.get('https://api.ipify.org', proxies={'https': 'https://{}:{}'.format(proxy['ip'], proxy['port'])})

With the forward proxy server

requests.get('http://api.ipify.org', proxies={'http': 'http://127.0.0.1:8081'})

System Design

Validation Policy

The validation proxy for proxy ips is described in validation_policy.py :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from datetime import datetime, timedelta

from scylla.database import ProxyIP


class ValidationPolicy(object):
    """
    ValidationPolicy will make decision about validating a proxy IP from the following aspects:
    1. Whether or not to validate the proxy
    2. Use http or https to validate the proxy

    After 3 attempts, the validator should try no more attempts in 24 hours after its creation.
    """
    proxy_ip: ProxyIP = None

    def __init__(self, proxy_ip: ProxyIP):
        """
        Constructor of ValidationPolicy
        :param proxy_ip: the ProxyIP instance to be validated
        """
        self.proxy_ip = proxy_ip

    def should_validate(self) -> bool:
        if self.proxy_ip.attempts == 0:
            return True
        elif self.proxy_ip.attempts < 3 \
                and datetime.now() - self.proxy_ip.created_at < timedelta(hours=24) \
                and not self.proxy_ip.is_valid:
            # If the proxy is created within 24 hours, the maximum attempt count is 3
            return True
        elif timedelta(hours=48) > datetime.now() - self.proxy_ip.created_at > timedelta(hours=24) \
                and self.proxy_ip.attempts < 6:
            # The proxy will be validated up to 6 times with in 48 hours after 24 hours
            return True
        elif datetime.now() - self.proxy_ip.created_at < timedelta(days=7) \
                and self.proxy_ip.attempts < 21 \
                and self.proxy_ip.is_valid:
            # After 48 hours the proxy is created, the proxy will be validated up to
            # 21 times (3 times a day on average) if it is valid within 7 days.
            return True
        # By default, return False
        return False

    def should_try_https(self) -> bool:
        if self.proxy_ip.is_valid and self.proxy_ip.attempts < 3 \
                and self.proxy_ip.https_attempts == 0:
            # Try https proxy for the 2nd and 3rd time if the proxy is valid
            return True

        return False

API Documentation

Please read Module Index.

Roadmap

Please see Projects.

Development and Contribution

git clone https://github.com/imWildCat/scylla.git
cd scylla

pip install -r requirements.txt

npm install # or `yarn install`
make build-assets

Testing

If you wish to run tests locally, the commands are shown below:

pip install -r tests/requirements-test.txt
pytest tests/

You are welcomed to add more test cases to this project, increasing the robustness of this project.

Naming of This Project

Scylla is derived from the name of a group of memory chips in the American TV series, Prison Break. This project was named after this American TV series to pay tribute to it.

Donation

If you find this project useful, could you please donate some money to it?

No matter how much the money is, Your donation will inspire the author to develop new features continuously! 🎉

Thank you!

The ways for donation are shown below:

PayPal

PayPal Donation Official

Alipay or WeChat Pay

Alipay and WeChat Donation

License

Apache License 2.0. For more details, please read the LICENSE file.

Indices and tables