Python之 Scrapy框架day02

第十一章 反爬及應對反爬的策略

隨著抓取的數據量到一定程度,數據重複及爬取過程中的死鏈問題會凸顯。怎麼來解決反爬問題呢?

11.1 網站如何發現爬蟲

一般來說,網站會有以下一些簡單的策略發現爬蟲程序:

1)單一IP非常規的訪問頻次;

2)單一IP非常規的數據流量;

3)大量重複簡單的網站瀏覽行為,只下載網頁,沒有後續的JS,CSS請求;

5)通過一些陷阱來發現爬蟲,例如一些通過CSS對用戶隱藏的鏈接,只有爬蟲才會訪問;

11.2 網站如何進行反爬

一般來說網站會採用下面兩個簡單的策略來防止爬蟲:

1.大量使用動態網頁,是的爬蟲的爬取難度增加,重要數據都拿不到,即使爬蟲採用了Web環境來渲染(內置瀏覽器),也會大大增加爬蟲的負擔和爬蟲時間;(當然,採用動態加載的技術,對服務器的負擔也會大大減輕)

2.基於流量的拒絕:

開啟帶寬限制模塊,限制每個IP最多連接數,最大帶寬等;

11.3 爬蟲如何發現自己可能被網站識別了

如果爬取過程中出現以下情況,那麼小心了,你的爬蟲可能被網站發現了:

1.驗證碼出現;

2.Unusual content delivery delay 非常規的延時;

3.Frequent response with HTTP 403, 404, 301 or 50x error;

11.4 爬蟲應對反爬的策略

我們可以從以下幾個方面來考慮應對反爬:

1)User-Agent池

2)代理服務器池

3)CookieJar等的管理

4)協議的細節考慮,如:需要大量的實踐經驗總結的

抓取數據時不處理CSS,JS等;

nofollow屬性;css的display屬性;探測陷阱;

驗證refer locator等;

5)使用分佈式的多機策略;爬慢點,把爬蟲放到訪問頻繁的主站IP子網下,如教育網;

6)使用了規則來批量爬取,需對規則進行組合;

7)驗證碼的搞定:機器學習,圖像識別;

8)儘可能遵守Robots協議;總結與進一步工作

這十一篇主要面對初級及中級爬蟲工程師的參考資料。由於本人能力及知識有限,目前只能總結到這裡。但是關於爬蟲的知識和技術,互聯網知識和技術更新換代非常快。後期本人會盡我所能,根據實際工程需要,增加新的實用的知識。附錄A 收集到的100個可能能用的代理服務器

106.39.179.236:80

23.94.191.219:1080

121.41.175.199:80

122.183.139.98:8080

118.193.107.182:80

92.42.109.45:1080

128.199.77.93:8080

46.101.60.239:8118

185.106.121.98:1080

185.82.203.81:1080

112.114.93.27:8118

104.131.69.203:80

138.201.0.184:1080

46.101.46.174:8118

178.62.123.38:8118

217.23.15.193:1080

60.168.207.208:8010

139.59.170.110:8118

223.241.118.228:8010

123.192.114.113:80

103.37.95.110:8000

180.179.43.250:80

185.117.74.81:1080

116.199.2.196:80

118.193.107.119:80

128.199.77.93:8000

170.246.114.213:8080

104.243.47.146:1080

111.3.108.44:8118

124.42.7.103:80

39.134.161.18:80

146.185.156.221:8118

47.89.249.110:80

118.193.107.192:80

124.232.163.10:3128

223.19.105.206:80

46.166.168.243:1080

118.114.77.47:8080

182.253.205.85:8090

45.55.132.29:9999

58.251.227.238:8118

118.193.107.142:80

118.193.107.135:80

118.193.107.219:80

46.101.45.212:8118

114.249.45.176:8118

80.152.201.116:8080

94.177.254.86:80

197.155.158.22:80

196.200.173.83:80

212.237.10.45:8080

188.166.144.173:8118

210.71.198.230:8118

177.114.228.112:8080

218.50.2.102:8080

198.204.251.158:1080

188.166.204.221:8118

185.117.74.126:1080

106.39.179.244:80

39.134.161.14:8080

85.10.247.136:1080

46.166.168.245:1080

5.167.50.35:3129

118.178.227.171:80

122.96.59.102:82

52.174.89.111:80

103.25.173.237:808

121.232.145.168:9000

103.251.167.8:1080

46.101.26.217:8118

171.37.178.175:9797

103.251.166.18:1080

186.225.176.93:8080

121.232.147.132:9000

104.224.168.178:8888

47.90.2.253:8118

121.232.145.82:9000

118.193.107.36:80

58.56.128.84:9001

139.59.153.59:80

122.183.139.101:8080

163.172.184.226:8118

198.204.251.146:1080

213.133.100.195:1080

42.104.84.106:8080

117.2.64.109:8888

121.232.144.229:9000

156.67.219.61:8080

138.36.106.90:80

1.179.233.66:80

222.33.192.238:8118

138.197.224.12:8118

151.106.10.6:1080

134.35.250.204:8080

58.251.227.233:8118

52.221.40.19:80

222.73.68.144:8090

46.166.168.247:1080

192.99.222.207:80

1.23.160.212:8080附錄B Python2與3 urllib庫對照表

參照 http://blog.csdn.net/whatday/article/details/54710403

Python2 與 Python3 urllib庫對照表:

urllib.urlretrieve() urllib.request.urlretrieve()

urllib.urlcleanup() urllib.request.urlcleanup()

urllib.quote() urllib.parse.quote()

urllib.quote_plus() urllib.parse.quote_plus()

urllib.unquote() urllib.parse.unquote()

urllib.unquote_plus() urllib.parse.unquote_plus()

urllib.urlencode() urllib.parse.urlencode()

urllib.pathname2url() urllib.request.pathname2url()

urllib.url2pathname() urllib.request.url2pathname()

urllib.getproxies() urllib.request.getproxies()

urllib.URLopener urllib.request.URLopener

urllib.FancyURLopener urllib.request.FancyURLopener

urllib.ContentTooShortError urllib.error.ContentTooShortError

urllib2.urlopen() urllib.request.urlopen()

urllib2.install_opener() urllib.request.install_opener()

urllib2.build_opener() urllib.request.build_opener()

urllib2.URLError urllib.error.URLError

urllib2.HTTPError urllib.error.HTTPError

urllib2.Request urllib.request.Request

urllib2.OpenerDirector urllib.request.OpenerDirector

urllib2.BaseHandler urllib.request.BaseHandler

urllib2.HTTPDefaultErrorHandler urllib.request.HTTPDefaultErrorHandler

urllib2.HTTPRedirectHandler urllib.request.HTTPRedirectHandler

urllib2.HTTPCookieProcessor urllib.request.HTTPCookieProcessor

urllib2.ProxyHandler urllib.request.ProxyHandler

urllib2.HTTPPasswordMgr urllib.request.HTTPPasswordMgr

urllib2.HTTPPasswordMgrWithDefaultRealm

urllib.request.HTTPPasswordMgrWithDefaultReam

urllib2.AbstractBasicAuthHandler

urllib.request.AbstractBasicAuthHandler

urllib2.HTTPBasicAuthHandler urllib.request.HTTPBasicAuthHandler

urllib2.ProxyBasicAuthHandler urllib.request.ProxyBasicAuthHandler

urllib2.AbstractDigestAuthHandler urllib.request.AbstractDigestAuthHandler

urllib2.HTTPDigestAuthHandler urllib.request.HTTPDigestAuthHandler

urllib2.ProxyDigestAuthHandler urllib.request.ProxyDigestAuthHandler

urllib2.HTTPHandler urllib.request.HTTPHandler

urllib2.HTTPSHandler urllib.request.HTTPSHandler

urllib2.FileHandler urllib.request.FileHandler

urllib2.FTPHandler urllib.request.FTPHandler

urllib2.CacheFTPHandler urllib.request.CacheFTPHandler

urllib2.UnknownHandler urllib.request.UnknownHandler

Python之 Scrapy框架day02


分享到:


相關文章: