擁有一個代理池會很大程度上的幫助我們進行工作,經過一番研究,一個小的代理池就出現了,刪去了很多的功能,留下了最主要得。因為儲存和獲取模塊相對簡單,所以合成為一個模塊。
無私分享全套Python爬蟲乾貨,如果你也想學習Python,@ 私信小編獲取粘貼出整個模塊代碼。獲取的代理的網站有很多,這裡只寫了一個。
<code>import requests from lxml import etree import time import pymongoclass
CAT_IP
():def
__init__
(
self
):self
.client = pymongo.MongoClient(host='localhost'
,port=27017
)self
.db =self
.client['proxy'
]self
.session = requests.Session()self
.headers={'Cookie'
:
'_free_proxy_session=BAh7B0kiD3Nlc3Npb25faWQGOgZFVEkiJWYwNzA1YmIzM2QzNTU0NGNjNmMyNWI3NDk1M2FlNmE5BjsAVEkiEF9jc3JmX3Rva2VuBjsARkkiMTQ5K3ZlRkx2dGs3ZmZMZTBjd1VLRTRHaUFCVDdKQTkxOTFIU3BYekYrdmc9BjsARg%3D%3D--8a2932ebb9c868977ffbc071eab471ef4144a1c6; Hm_lvt_0cf76c77469e965d2957f0553e6ecf59=1545528007,1545529206,1545554081; Hm_lpvt_0cf76c77469e965d2957f0553e6ecf59=1545554192'
,'User-Agent'
:
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
,'Host'
:
'www.xicidaili.com'
}self
.session.get(url=self
.url, headers=self
.headers)def
the_xici
(
self
):for
iin
range(3
): time.sleep(1
) the_url = baseurl.format(i+1
) re = requests.get(url=the_url,headers=self
.headers) re = re.content.decode('utf-8'
) html = etree.HTML(text=re) targets = html.xpath('//table[@id="ip_list"]//tr'
) del targets[0
]for
targetin
targets:
target_ip =''
.join(target.xpath('./td[2]/text()'
)) target_port =''
.join(target.xpath('./td[3]/text()'
)) result ='{}:{}'
.format(target_ip,target_port) print('獲取代理{}'
.format(result))yield
{'dl'
:result
}def
save_all_to_waitingArea
(
self
,lists): collection =self
.db['wait_area'
] collection.remove({}) collection.insert_many(lists) print('儲存所有代理成功'
)if
__name__
=='__main__'
: lists = CAT_IP().the_xici() CAT_IP().save_all_to_waitingArea(lists) /<code>
一共有兩個方法,一個是爬取免費代理的方法,另一個是將代理全部存入數據庫的[‘wait_area’]表單。
在__init__方法裡進行了一些參數的初始化
第二個模塊是檢測模塊
代碼如下
<code>import
pymongoimport
requestsimport
threadingclass
CHECK_PROXY
()
:def
__init__
(self)
: self.client = pymongo.MongoClient(host='localhost'
,port=27017
) self.db = self.client['proxy'
] self.session = requests.Session() self.target_url ='https://mp.csdn.net/mdeditor#'
def
save_one_to_useArea
(self,proxy)
: conllection = self.db['use_area'
] is_live = conllection.find_one({'dl'
:proxy['dl'
]})if
is_live ==None
: conllection.insert(proxy)else
: print('已經存在{}'
.format(proxy))def
get_one_proxy
(self)
: conllection = self.db['wait_area'
] proxy = conllection.find_one() conllection.remove(proxy)return
proxydef
test_IP
(self,IP)
: proxies = {"http"
:"http://{}"
.format(IP),"https"
:"http://{}"
.format(IP) }try
:with
self.session.get(url=self.target_url,proxies=proxies)as
response:if
response.status_code ==200
: print('代理{}測試成功'
.format(IP))return
True
except
: print('代理{}測試失敗'
.format(IP))return
False
def
check_count
(self)
: conllection = self.db['wait_area'
]return
conllection.count()def
check_proxy
(self)
: proxy = CHECK_PROXY().get_One_proxy() response = CHECK_PROXY().test_IP(proxy['dl'
])if
response ==True
: CHECK_PROXY().save_one_to_useArea(proxy=proxy)if
CHECK_PROXY().check_count()>0
: CHECK_PROXY().check_proxy()else
:if
CHECK_PROXY().check_count()>0
: CHECK_PROXY().check_proxy()if
__name__ =="__main__"
:for
tin
range(7
): thread = threading.Thread(target=CHECK_PROXY().check_proxy,args=()) thread.start() /<code>
同樣,在__init__區域進行了一些參數的初始化,save_one_to_useArea(self,proxy):為將一個代理放入數據庫中的[‘use_area’]表單,用於測試成功的代理
為了幫助大家更輕鬆的學好Python,我給大家分享一套Python學習資料,希望對正在學習的你有所幫助!
獲取方式:關注並私信小編 “ 學習 ”,即可免費獲取!