Scylla 中文文档¶
Scylla 是一款高质量的免费代理 IP 池工具,仅支持 Python 3.6。特性如下:
- 自动化的代理 IP 爬取与验证
- 易用的 JSON API
- 简单但美观的 web 用户界面,基于 TypeScript 和 React(例如,代理的地理分布)
- 最少仅用一条命令即可启动
- 简明直接的编程 API(将在 1.1 版本中加入)
- 最少仅用一行代码即可与 Scrapy 和 requests 进行集成
- 无头浏览器(headless browser crawling)爬虫
快速开始¶
安装¶
Docker 安装(推荐)¶
docker run -d -p 8899:8899 -p 8081:8081 -v /var/www/scylla:/var/www/scylla --name scylla wildcat/scylla:latest
使用 pip 直接安装¶
pip install scylla
scylla --help
scylla # 运行爬虫和 Web 服务器
从源代码安装¶
git clone https://github.com/imWildCat/scylla.git
cd scylla
pip install -r requirements.txt
npm install # 或 yarn install
make build-assets
python -m scylla
使用¶
这里以服务运行在本地(localhost
)为例,使用口号 8899
。
注意:首次运行本项目时,您可能需要等待 1~2 分钟以爬取一定量的代理 IP。
JSON API¶
代理 IP 列表¶
http://localhost:8899/api/v1/proxies
可选 URL 参数:
参数 | 默认值 | 说明 |
---|---|---|
page | 1 |
页码 |
limit | 20 |
每页显示代理 IP 的数量 |
anonymous | any |
是否显示匿名代理。可选值:true ,只显示匿名代理;false ,只显示透明代理。 |
https | any |
是否显示 HTTPS 代理。可选值:true ,只显示 HTTPS 代理;false ,只显示 HTTP 代理。 |
countries | 无 | 只选取特定国家的代理,格式示例:US ,或者多国家:US,GB |
结果样例:
{
"proxies": [{
"id": 599,
"ip": "91.229.222.163",
"port": 53281,
"is_valid": true,
"created_at": 1527590947,
"updated_at": 1527593751,
"latency": 23.0,
"stability": 0.1,
"is_anonymous": true,
"is_https": true,
"attempts": 1,
"https_attempts": 0,
"location": "54.0451,-0.8053",
"organization": "AS57099 Boundless Networks Limited",
"region": "England",
"country": "GB",
"city": "Malton"
}, {
"id": 75,
"ip": "75.151.213.85",
"port": 8080,
"is_valid": true,
"created_at": 1527590676,
"updated_at": 1527593702,
"latency": 268.0,
"stability": 0.3,
"is_anonymous": true,
"is_https": true,
"attempts": 1,
"https_attempts": 0,
"location": "32.3706,-90.1755",
"organization": "AS7922 Comcast Cable Communications, LLC",
"region": "Mississippi",
"country": "US",
"city": "Jackson"
},
...
],
"count": 1025,
"per_page": 20,
"page": 1,
"total_page": 52
}
系统统计¶
http://localhost:8899/api/v1/stats
结果样例:
{
"median": 181.2566407083,
"valid_count": 1780,
"total_count": 9528,
"mean": 174.3290085201
}
HTTP 正向代理服务器¶
默认情况下,Scylla 会在端口 8081
启动一个 HTTP 正向代理服务器(Forward Proxy Server)。
这个服务器会从数据库中选择一个刚更新过的代理,并将其用作正向代理。
每当发出 HTTP 请求时,代理服务器将随机选择一个代理。
注意:目前不支持 HTTPS 请求。
使用此代理服务器的 “curl” 示例如下:
curl http://api.ipify.org -x http://127.0.0.1:8081
你也可以在 requests 中使用这个特性:
requests.get('http://api.ipify.org', proxies={'http': 'http://127.0.0.1:8081'})
其他示例¶
Requests 的一些例子¶
Requests 是一个非常好用而且成熟的 Python HTTP 框架。和它一起使用 Scylla 非常简单。
调用 JSON API¶
import requests
import random
json_resp = requests.get('http://localhost:8899/api/v1/proxies').json()
proxy = random.choice(json_resp['proxies'])
requests.get('http://api.ipify.org', proxies={'http': 'http://{}:{}'.format(proxy['ip'], proxy['port'])})
也支持 HTTPS 代理:
import requests
import random
json_resp = requests.get('http://localhost:8899/api/v1/proxies?https=true').json()
proxy = random.choice(json_resp['proxies'])
requests.get('https://api.ipify.org', proxies={'https': 'https://{}:{}'.format(proxy['ip'], proxy['port'])})
使用正向代理服务器¶
requests.get('http://api.ipify.org', proxies={'http': 'http://127.0.0.1:8081'})
系统设计¶
验证策略¶
代理 IP 的验证策略在 validation_policy.py
可见:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | from datetime import datetime, timedelta
from scylla.database import ProxyIP
class ValidationPolicy(object):
"""
ValidationPolicy will make decision about validating a proxy IP from the following aspects:
1. Whether or not to validate the proxy
2. Use http or https to validate the proxy
After 3 attempts, the validator should try no more attempts in 24 hours after its creation.
"""
proxy_ip: ProxyIP = None
def __init__(self, proxy_ip: ProxyIP):
"""
Constructor of ValidationPolicy
:param proxy_ip: the ProxyIP instance to be validated
"""
self.proxy_ip = proxy_ip
def should_validate(self) -> bool:
if self.proxy_ip.attempts == 0:
return True
elif self.proxy_ip.attempts < 3 \
and datetime.now() - self.proxy_ip.created_at < timedelta(hours=24) \
and not self.proxy_ip.is_valid:
# If the proxy is created within 24 hours, the maximum attempt count is 3
return True
elif timedelta(hours=48) > datetime.now() - self.proxy_ip.created_at > timedelta(hours=24) \
and self.proxy_ip.attempts < 6:
# The proxy will be validated up to 6 times with in 48 hours after 24 hours
return True
elif datetime.now() - self.proxy_ip.created_at < timedelta(days=7) \
and self.proxy_ip.attempts < 21 \
and self.proxy_ip.is_valid:
# After 48 hours the proxy is created, the proxy will be validated up to
# 21 times (3 times a day on average) if it is valid within 7 days.
return True
# By default, return False
return False
def should_try_https(self) -> bool:
if self.proxy_ip.is_valid and self.proxy_ip.attempts < 3 \
and self.proxy_ip.https_attempts == 0:
# Try https proxy for the 2nd and 3rd time if the proxy is valid
return True
return False
|
开发与贡献¶
git clone https://github.com/imWildCat/scylla.git
cd scylla
pip install -r requirements.txt
npm install # 或 `yarn install`
make build-assets