第十一章 反爬及應對反爬的策略
隨著抓取的數據量到一定程度,數據重複及爬取過程中的死鏈問題會凸顯。怎麼來解決反爬問題呢?
11.1 網站如何發現爬蟲
一般來說,網站會有以下一些簡單的策略發現爬蟲程序:
1)單一IP非常規的訪問頻次;
2)單一IP非常規的數據流量;
3)大量重複簡單的網站瀏覽行為,只下載網頁,沒有後續的JS,CSS請求;
5)通過一些陷阱來發現爬蟲,例如一些通過CSS對用戶隱藏的鏈接,只有爬蟲才會訪問;
11.2 網站如何進行反爬
一般來說網站會採用下面兩個簡單的策略來防止爬蟲:
1.大量使用動態網頁,是的爬蟲的爬取難度增加,重要數據都拿不到,即使爬蟲採用了Web環境來渲染(內置瀏覽器),也會大大增加爬蟲的負擔和爬蟲時間;(當然,採用動態加載的技術,對服務器的負擔也會大大減輕)
2.基於流量的拒絕:
開啟帶寬限制模塊,限制每個IP最多連接數,最大帶寬等;
11.3 爬蟲如何發現自己可能被網站識別了
如果爬取過程中出現以下情況,那麼小心了,你的爬蟲可能被網站發現了:
1.驗證碼出現;
2.Unusual content delivery delay 非常規的延時;
3.Frequent response with HTTP 403, 404, 301 or 50x error;
11.4 爬蟲應對反爬的策略
我們可以從以下幾個方面來考慮應對反爬:
1)User-Agent池;
2)代理服務器池;
3)CookieJar等的管理;
4)協議的細節考慮,如:需要大量的實踐經驗總結的
抓取數據時不處理CSS,JS等;
nofollow屬性;css的display屬性;探測陷阱;
驗證refer locator等;
5)使用分佈式的多機策略;爬慢點,把爬蟲放到訪問頻繁的主站IP子網下,如教育網;
6)使用了規則來批量爬取,需對規則進行組合;
7)驗證碼的搞定:機器學習,圖像識別;
8)儘可能遵守Robots協議;總結與進一步工作
這十一篇主要面對初級及中級爬蟲工程師的參考資料。由於本人能力及知識有限,目前只能總結到這裡。但是關於爬蟲的知識和技術,互聯網知識和技術更新換代非常快。後期本人會盡我所能,根據實際工程需要,增加新的實用的知識。附錄A 收集到的100個可能能用的代理服務器
106.39.179.236:80
23.94.191.219:1080
121.41.175.199:80
122.183.139.98:8080
118.193.107.182:80
92.42.109.45:1080
128.199.77.93:8080
46.101.60.239:8118
185.106.121.98:1080
185.82.203.81:1080
112.114.93.27:8118
104.131.69.203:80
138.201.0.184:1080
46.101.46.174:8118
178.62.123.38:8118
217.23.15.193:1080
60.168.207.208:8010
139.59.170.110:8118
223.241.118.228:8010
123.192.114.113:80
103.37.95.110:8000
180.179.43.250:80
185.117.74.81:1080
116.199.2.196:80
118.193.107.119:80
128.199.77.93:8000
170.246.114.213:8080
104.243.47.146:1080
111.3.108.44:8118
124.42.7.103:80
39.134.161.18:80
146.185.156.221:8118
47.89.249.110:80
118.193.107.192:80
124.232.163.10:3128
223.19.105.206:80
46.166.168.243:1080
118.114.77.47:8080
182.253.205.85:8090
45.55.132.29:9999
58.251.227.238:8118
118.193.107.142:80
118.193.107.135:80
118.193.107.219:80
46.101.45.212:8118
114.249.45.176:8118
80.152.201.116:8080
94.177.254.86:80
197.155.158.22:80
196.200.173.83:80
212.237.10.45:8080
188.166.144.173:8118
210.71.198.230:8118
177.114.228.112:8080
218.50.2.102:8080
198.204.251.158:1080
188.166.204.221:8118
185.117.74.126:1080
106.39.179.244:80
39.134.161.14:8080
85.10.247.136:1080
46.166.168.245:1080
5.167.50.35:3129
118.178.227.171:80
122.96.59.102:82
52.174.89.111:80
103.25.173.237:808
121.232.145.168:9000
103.251.167.8:1080
46.101.26.217:8118
171.37.178.175:9797
103.251.166.18:1080
186.225.176.93:8080
121.232.147.132:9000
104.224.168.178:8888
47.90.2.253:8118
121.232.145.82:9000
118.193.107.36:80
58.56.128.84:9001
139.59.153.59:80
122.183.139.101:8080
163.172.184.226:8118
198.204.251.146:1080
213.133.100.195:1080
42.104.84.106:8080
117.2.64.109:8888
121.232.144.229:9000
156.67.219.61:8080
138.36.106.90:80
1.179.233.66:80
222.33.192.238:8118
138.197.224.12:8118
151.106.10.6:1080
134.35.250.204:8080
58.251.227.233:8118
52.221.40.19:80
222.73.68.144:8090
46.166.168.247:1080
192.99.222.207:80
1.23.160.212:8080附錄B Python2與3 urllib庫對照表
參照 http://blog.csdn.net/whatday/article/details/54710403
Python2 與 Python3 urllib庫對照表:
urllib.urlretrieve() urllib.request.urlretrieve()
urllib.urlcleanup() urllib.request.urlcleanup()
urllib.quote() urllib.parse.quote()
urllib.quote_plus() urllib.parse.quote_plus()
urllib.unquote() urllib.parse.unquote()
urllib.unquote_plus() urllib.parse.unquote_plus()
urllib.urlencode() urllib.parse.urlencode()
urllib.pathname2url() urllib.request.pathname2url()
urllib.url2pathname() urllib.request.url2pathname()
urllib.getproxies() urllib.request.getproxies()
urllib.URLopener urllib.request.URLopener
urllib.FancyURLopener urllib.request.FancyURLopener
urllib.ContentTooShortError urllib.error.ContentTooShortError
urllib2.urlopen() urllib.request.urlopen()
urllib2.install_opener() urllib.request.install_opener()
urllib2.build_opener() urllib.request.build_opener()
urllib2.URLError urllib.error.URLError
urllib2.HTTPError urllib.error.HTTPError
urllib2.Request urllib.request.Request
urllib2.OpenerDirector urllib.request.OpenerDirector
urllib2.BaseHandler urllib.request.BaseHandler
urllib2.HTTPDefaultErrorHandler urllib.request.HTTPDefaultErrorHandler
urllib2.HTTPRedirectHandler urllib.request.HTTPRedirectHandler
urllib2.HTTPCookieProcessor urllib.request.HTTPCookieProcessor
urllib2.ProxyHandler urllib.request.ProxyHandler
urllib2.HTTPPasswordMgr urllib.request.HTTPPasswordMgr
urllib2.HTTPPasswordMgrWithDefaultRealm
urllib.request.HTTPPasswordMgrWithDefaultReam
urllib2.AbstractBasicAuthHandler
urllib.request.AbstractBasicAuthHandler
urllib2.HTTPBasicAuthHandler urllib.request.HTTPBasicAuthHandler
urllib2.ProxyBasicAuthHandler urllib.request.ProxyBasicAuthHandler
urllib2.AbstractDigestAuthHandler urllib.request.AbstractDigestAuthHandler
urllib2.HTTPDigestAuthHandler urllib.request.HTTPDigestAuthHandler
urllib2.ProxyDigestAuthHandler urllib.request.ProxyDigestAuthHandler
urllib2.HTTPHandler urllib.request.HTTPHandler
urllib2.HTTPSHandler urllib.request.HTTPSHandler
urllib2.FileHandler urllib.request.FileHandler
urllib2.FTPHandler urllib.request.FTPHandler
urllib2.CacheFTPHandler urllib.request.CacheFTPHandler
urllib2.UnknownHandler urllib.request.UnknownHandler
閱讀更多 小悅 的文章