爬蟲:Requests請求庫

一 介紹

Python內置為我們提供了一個內置的模塊叫urllib,是用於訪問網絡資源的,但是由於它內部缺少一些實用的功能,所以用起來比較麻煩。後來出現了一個第三方模塊叫 "Requests",Requests 繼承了urllib2的所有特性。Requests支持HTTP連接保持和連接池,支持使用cookie保持會話,支持文件上傳,支持自動確定響應內容的編碼,支持國際化的 URL 和 POST 數據自動編碼。換句話說就是requests模塊的功能比urllib更加強大!

Requests可以模擬瀏覽器的請求,比起之前用到的urllib模塊更加便捷,因為requests本質上就是基於urllib3來封裝的。

1、安裝

安裝:pip3 install requests

2、各種請求方式

>>> import requests
>>> r = requests.get('https://api.github.com/events')
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')
建議在正式學習requests前,先熟悉下HTTP協議
https://www.cnblogs.com/kermitjam/articles/9692568.html

二 基於GET請求

1、基本請求

import requests
response = requests.get('https://www.cnblogs.com/kermitjam/')
print(response.text)

2、帶參數的GET請求->params

from urllib.parse import urlencode
import requests
# q後面攜帶的是中文墨菲定律
response1 = requests.get('https://list.tmall.com/search_product.htm?q=%C4%AB%B7%C6%B6%A8%C2%C9')
print(response1.text)
# 因為字符編碼的問題,以至於中文變成了一些特殊的字符,所以我們要找到一種解決方案
url = 'https://list.tmall.com/search_product.htm?' + urlencode({'q': '墨菲定律'})
response2 = requests.get(url)
print(response2.text)
# get方法為我們提供了一個參數params,它內部其實就是urlencode
response3 = requests.get('https://list.tmall.com/search_product.htm?', params={"q": "墨菲定律"})
print(response3.text)

3、帶參數的GET請求->headers

通常我們發送請求的時候都需要帶上請求頭,請求頭是將爬蟲程序偽裝成瀏覽器的關鍵,最為常用的請求頭如下:

Host # 當前目標站點的地址 

Referer # 大型網站通常都會根據該參數判斷請求的來源
User-Agent # 用來表示你的爬蟲程序是瀏覽器客戶端
Cookie # Cookie信息雖然包含在請求頭裡,但requests模塊有單獨的參數來處理他,headers={}內就不要放它了
# 添加headers(瀏覽器會識別請求頭,不加可能會被拒絕訪問,比如訪問https://www.zhihu.com/explore)
import requests
response = requests.get('https://www.zhihu.com/explore')
print(response.status_code) # 400
# 自己定製headers
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36',
}
response = requests.get('https://www.zhihu.com/explore',
headers=headers)
print(response.status_code) # 200

4、帶參數的GET請求->cookies

import uuid
import requests
url = 'http://httpbin.org/cookies'
cookies = dict(id=str(uuid.uuid4()))
res = requests.get(url, cookies=cookies)
print(res.json())

5、GET請求跳過github登錄

'''
# GET請求跳過github
1.去github登錄,賦值網頁的cookies信息
2.直接攜帶cookies往settings/emails頁面發送get請求查看手機號碼是否在頁面中

'''
import requests
url = 'https://github.com/settings/emails'
# 登錄之後獲取的cookies
COOKIES = {
'Cookie': 'has_recent_activity=1; _device_id=7461de19eff07d6573a56a066339a960; _octo=GH1.1.319526069.1558361584; user_session=QRx3XyXhwr3AHjuII-Wxb8_ierjBAwbevSrHm4Rv6ZwXEUh-; __Host-user_session_same_site=QRx3XyXhwr3AHjuII-Wxb8_ierjBAwbevSrHm4Rv6ZwXEUh-; logged_in=yes; dotcom_user=TankJam; _ga=GA1.2.962318257.1558361589; _gat=1; tz=Asia%2FShanghai; _gh_sess=NFB0ZHIxckxVZjBkaXh1YlFwenRkOVNXQnczR0FnYy9VWUN6ZEdGOE4rSzR0U2FWZlRjM0JRdURTUWV5NVF0cmdIZVBiaUlUYTBSWGxnNklERFRuQzhaOFVBVW1SZ2ZjOHMzQWwxWHdKb1F3elh4M3JCbEkyVGZiMGNZVnlmeXVud1V3NFdGbVZFR2EvL0JUT1FQQzdaSTM5V3Uzc0F4WWxRWkxURGZDOWxSNEd1WXBScEREOHJvL0VGeFRCMlRUZDZ5bFZLemRvRkhmRUI4b0MyRWtVYmxDcm03VmlQcGJoZVkyMnRvMXJEVUI4VnhuUTVWREFXVXZORWcvYnJpWVl3a2w0MnJhbGFmd2JKY2pWWkJZYXc9PS0tOVBOVlNQTkFBZkdtNTBFaGgrc2pRZz09--e89729402a6138aeeddc8a191997ec1199eeaa5c'
}
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
}
response = requests.get(url, headers=HEADERS, cookies=COOKIES)
print('15622792660' in response.text) # True

三 基於POST請求

1、介紹

'''
GET請求: (HTTP默認的請求方法就是GET)
* 沒有請求體
* 數據必須在1K之內!
* GET請求數據會暴露在瀏覽器的地址欄中
GET請求常用的操作:
1. 在瀏覽器的地址欄中直接給出URL,那麼就一定是GET請求
2. 點擊頁面上的超鏈接也一定是GET請求
3. 提交表單時,表單默認使用GET請求,但可以設置為POST
POST請求
(1). 數據不會出現在地址欄中
(2). 數據的大小沒有上限
(3). 有請求體
(4). 請求體中如果存在中文,會使用URL編碼!

!!!requests.post()用法與requests.get()完全一致,特殊的是requests.post()有一個data參數,用來存放請求體數據!
'''

2、發送post請求,模擬瀏覽器的登錄行為

對於登錄來說,應該在登錄輸入框內輸錯用戶名或密碼然後抓包分析通信流程,假如輸對了瀏覽器就直接跳轉了,還分析什麼鬼?就算累死你也找不到數據包。

'''
POST請求自動登錄github。
github反爬:
1.session登錄請求需要攜帶login頁面返回的cookies
2.email頁面需要攜帶session頁面後的cookies
'''
import requests
import re
# 一 訪問login獲取authenticity_token
login_url = 'https://github.com/login'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Referer': 'https://github.com/'
}
login_res = requests.get(login_url, headers=headers)
# print(login_res.text)
authenticity_token = re.findall('name="authenticity_token" value="(.*?)"', login_res.text, re.S)[0]
# print(authenticity_token)
login_cookies = login_res.cookies.get_dict()
# 二 攜帶token在請求體內往session發送POST請求
session_url = 'https://github.com/session'
session_headers = {
'Referer': 'https://github.com/login',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',

}
form_data = {
"commit": "Sign in",
"utf8": "✓",
"authenticity_token": authenticity_token,
"login": "tankjam",
"password": "kermit46709394",
'webauthn-support': "supported"
}
# 三 開始測試是否登錄
session_res = requests.post(
session_url,
data=form_data,
cookies=login_cookies,
headers=session_headers,
# allow_redirects=False
)
session_cookies = session_res.cookies.get_dict()
url3 = 'https://github.com/settings/emails'
email_res = requests.get(url3, cookies=session_cookies)
print('15622792660' in email_res.text)
'''
POST請求自動登錄之Session對象:
通過requests.session可以攜帶之前所有請求的cookies
'''
import requests
import re
# 一 訪問login獲取authenticity_token
login_url = 'https://github.com/login'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Referer': 'https://github.com/'
}
session = requests.session()
login_res = session.get(login_url, headers=headers)
# print(login_res.text)
authenticity_token = re.findall('name="authenticity_token" value="(.*?)"', login_res.text, re.S)[0]
# print(authenticity_token)
# 二 攜帶token在請求體內往session發送POST請求
session_url = 'https://github.com/session'
session_headers = {
'Referer': 'https://github.com/login',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
}

form_data = {
"commit": "Sign in",
"utf8": "✓",
"authenticity_token": authenticity_token,
"login": "tankjam",
"password": "kermit46709394",
'webauthn-support': "supported"
}
# 三 開始測試是否登錄
session_res = session.post(
session_url,
data=form_data,
# cookies=login_cookies,
headers=session_headers,
# allow_redirects=False
)
url3 = 'https://github.com/settings/emails'
email_res = session.get(url3)
print('15622792660' in email_res.text)

3、注意

'''
注意點:
1.沒有指定請求頭,#默認的請求頭:application/x-www-form-urlencoed。
2.如果我們自定義請求頭是application/json,並且用data傳值, 則服務端取不到值。
3.ost裡面帶有json參數,默認的請求頭:application/json
'''
import requests
# 沒有指定請求頭,#默認的請求頭:application/x-www-form-urlencoed
requests.post(url='xxxxxxxx',
data={'xxx': 'yyy'})
# 如果我們自定義請求頭是application/json,並且用data傳值, 則服務端取不到值
requests.post(url='',
data={'id': 9527, },
headers={
'content-type': 'application/json'

})
# post裡面帶有json參數,默認的請求頭:application/json
requests.post(url='',
json={'id': 9527, },
)

四 響應Response

1、response屬性

import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36',
}
response = requests.get('https://www.github.com', headers=headers)
# response響應
print(response.status_code) # 獲取響應狀態碼
print(response.url) # 獲取url地址
print(response.text) # 獲取文本
print(response.content) # 獲取二進制流
print(response.headers) # 獲取頁面請求頭信息
print(response.history) # 上一次跳轉的地址
print(response.cookies) # # 獲取cookies信息
print(response.cookies.get_dict()) # 獲取cookies信息轉換成字典
print(response.cookies.items()) # 獲取cookies信息轉換成字典
print(response.encoding) # 字符編碼
print(response.elapsed) # 訪問時間

2、編碼問題

爬蟲:Requests請求庫

import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36',
}
# 編碼問題
response = requests.get('http://www.autohome.com/news', headers=headers)
# print(response.text)
# 汽車之家網站返回的頁面內容編碼為gb2312,而requests的默認編碼為ISO-8859-1,如果不設置成gbk則中文亂碼
response.encoding = 'gbk'
print(response.text)

3、獲取二進制數據

import requests
# 獲取二進制流
url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1557981645442&di=688744cc87ffd353a5720e5d942d588b&imgtype=0&class="lazy" src="//p2.ttnews.xyz/loading.gif" data-original=http%3A%2F%2Fk.zol-img.com.cn%2Fsjbbs%2F7692%2Fa7691515_s.jpg'

response = requests.get(url)
print(response.content)
# 一次性寫入二進制流
with open('dog_baby.jpg', 'wb') as f:
f.write(response.content)

4、解析json

# 解析json
import requests
import json
response = requests.get('https://landing.toutiao.com/api/pc/realtime_news/')
print(response.text) # 返回json格式數據
# 通過json模塊反序列化,但是太麻煩了
new_dict = json.loads(response.text)
print(new_dict)
# 返回的response內部給我們提供瞭解析json數據的方法
new_dict = response.json()
print(new_dict)

5、Redirection and History

By default Requests will perform location redirection for all verbs except HEAD.
We can use the history property of the Response object to track redirection.
The Response.history list contains the Response objects that were created in order to complete the request. The list is sorted from the oldest to the most recent response.
For example, GitHub redirects all HTTP requests to HTTPS:
>>> r = requests.get('http://github.com')
>>> r.url
'https://github.com/'
>>> r.status_code
200
>>> r.history
[<response>]
If you're using GET, OPTIONS, POST, PUT, PATCH or DELETE, you can disable redirection handling with the allow_redirects parameter:
>>> r = requests.get('http://github.com', allow_redirects=False)
>>> r.status_code
301
>>> r.history
[]

If you're using HEAD, you can enable redirection as well:
>>> r = requests.head('http://github.com', allow_redirects=True)
>>> r.url
'https://github.com/'
>>> r.history
[<response>]
/<response>/<response>
# redirect and history
'''
commit: Sign in
utf8: ✓
authenticity_token: login頁面中獲取
login: kermitjam
password: kermit46709394
'webauthn-support': 'supported'
'''
import requests
import re
# 一 訪問login獲取authenticity_token
login_url = 'https://github.com/login'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Referer': 'https://github.com/'
}
login_res = requests.get(login_url, headers=headers)
# print(login_res.text)
authenticity_token = re.findall('name="authenticity_token" value="(.*?)"', login_res.text, re.S)[0]
print(authenticity_token)
login_cookies = login_res.cookies.get_dict()
# 二 攜帶token在請求體內往session發送POST請求
session_url = 'https://github.com/session'
session_headers = {
'Referer': 'https://github.com/login',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
}
form_data = {
"commit": "Sign in",
"utf8": "✓",
"authenticity_token": authenticity_token,
"login": "tankjam",
"password": "kermit46709394",
'webauthn-support': 'supported'
}
# 三 開始測試history
# 測試1
session_res = requests.post(

session_url,
data=form_data,
cookies=login_cookies,
headers=session_headers,
# allow_redirects=False
)
print(session_res.url) # 跳轉後的地址>https://github.com/
print(session_res.history) # 返回的是一個列表,裡面有跳轉前的response對象
print(session_res.history[0].text) # 跳轉前的response.text
# 測試2
session_res = requests.post(
session_url,
data=form_data,
cookies=login_cookies,
headers=session_headers,
allow_redirects=False
)
print(session_res.status_code) # 302
print(session_res.url) # 取消跳轉,返回的是當前url
print(session_res.history) # 返回一個空列表

五 高級用法(瞭解)

1、SSL Cert Verification

- https://www.xiaohuar.com/
爬蟲:Requests請求庫

爬蟲:Requests請求庫

爬蟲:Requests請求庫

#證書驗證(大部分網站都是https)
import requests
# 如果是ssl請求,首先檢查證書是否合法,不合法則報錯,程序終端
response = requests.get('https://www.xiaohuar.com')
print(response.status_code)
# 改進1:去掉報錯,但是會報警告
import requests
response = requests.get('https://www.xiaohuar.com', verify=False)
# 不驗證證書,報警告,返回200
print(response.status_code)
# 改進2:去掉報錯,並且去掉警報信息
import requests
import urllib3
urllib3.disable_warnings() # 關閉警告
response = requests.get('https://www.xiaohuar.com', verify=False)
print(response.status_code)
# 改進3:加上證書
# 很多網站都是https,但是不用證書也可以訪問,大多數情況都是可以攜帶也可以不攜帶證書
# 知乎\\百度等都是可帶可不帶
# 有硬性要求的,則必須帶,比如對於定向的用戶,拿到證書後才有權限訪問某個特定網站
import requests
import urllib3
# urllib3.disable_warnings() # 關閉警告
response = requests.get(
'https://www.xiaohuar.com',
# verify=False,
cert=('/path/server.crt', '/path/key'))
print(response.status_code)

2、超時設置

# 超時設置
# 兩種超時:float or tuple
# timeout=0.1 # 代表接收數據的超時時間
# timeout=(0.1,0.2) # 0.1代表鏈接超時 0.2代表接收數據的超時時間
import requests
response = requests.get('https://www.baidu.com',
timeout=0.0001)

3、使用代理

# 官網鏈接: http://docs.python-requests.org/en/master/user/advanced/#proxies
# 代理設置:先發送請求給代理,然後由代理幫忙發送(封ip是常見的事情)
import requests
proxies={
# 帶用戶名密碼的代理,@符號前是用戶名與密碼
'http':'http://tank:123@localhost:9527',
'http':'http://localhost:9527',
'https':'https://localhost:9527',
}
response=requests.get('https://www.12306.cn',
proxies=proxies)
print(response.status_code)
# 支持socks代理,安裝:pip install requests[socks]
import requests
proxies = {
'http': 'socks5://user:pass@host:port',
'https': 'socks5://user:pass@host:port'
}
respone=requests.get('https://www.12306.cn',
proxies=proxies)
print(respone.status_code)
''' 

爬取西刺免費代理:
1.訪問西刺免費代理頁面
2.通過re模塊解析並提取所有代理
3.通過ip測試網站對爬取的代理進行測試
4.若test_ip函數拋出異常代表代理作廢,否則代理有效
5.利用有效的代理進行代理測試

Cn
112.85.131.99
9999




高匿
HTTPS












6天
19-05-16 11:20

re:
(.*?).*?(.*?)
'''

import requests
import re
import time
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
}
def get_index(url):
time.sleep(1)
response = requests.get(url, headers=HEADERS)
return response
def parse_index(text):
ip_list = re.findall('.*?(.*?).*?(.*?)', text, re.S)
for ip_port in ip_list:
ip = ':'.join(ip_port)
yield ip
def test_ip(ip):
print('測試ip: %s' % ip)
try:
proxies = {
'https': ip
}
# ip測試網站
ip_url = 'https://www.ipip.net/'
# 使用有效與無效的代理對ip測試站點進行訪問,若返回的結果為200則代表當前測試ip正常
response = requests.get(ip_url, headers=HEADERS, proxies=proxies, timeout=1)
if response.status_code == 200:
return ip
# 若ip代理無效則拋出異常
except Exception as e:
print(e)
# 使用代理爬取nba
def spider_nba(good_ip):
url = 'https://china.nba.com/'
proxies = {
'https': good_ip
}
response = requests.get(url, headers=HEADERS, proxies=proxies)
print(response.status_code)
print(response.text)

if __name__ == '__main__':
base_url = 'https://www.xicidaili.com/nn/{}'
for line in range(1, 3677):
ip_url = base_url.format(line)
response = get_index(ip_url)
ip_list = parse_index(response.text)
for ip in ip_list:
# print(ip)
good_ip = test_ip(ip)
if good_ip:
# 真是代理,開始測試
spider_nba(good_ip)

4、認證設置

# 認證設置
'''
登錄網站時,會彈出一個框,要求你輸入用戶名與密碼(類似於alert),此時無法進入html頁面,待授權通過後才能進入html頁面。
Requests模塊為我們提供了多種身份認證方式,包括基本身份認證等...
其原理指的是通過輸入用戶名與密碼獲取用戶的憑證來識別用戶,然後通過token對用戶進行授權。
基本身份認證:
HTTP Basic Auth是HTTP1.0提出的認證方式。客戶端對於每一個realm,通過提供用戶名和密碼來進行認證的方式當認證失敗時,服務器收到客戶端請求,返回401。
'''
import requests
# 通過訪問github的api來測試
url = 'https://api.github.com/user'
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',

}
# 測試1,失敗返回401
response = requests.get(url, headers=HEADERS)
print(response.status_code) # 401
print(response.text)
'''
打印結果:
{
"message": "Requires authentication",
"documentation_url": "https://developer.github.com/v3/users/#get-the-authenticated-user"
}
'''
# 測試2,通過requests.auth內的HTTPBasicAuth進行認證,認證成功返回用戶信息
from requests.auth import HTTPBasicAuth
response = requests.get(url, headers=HEADERS, auth=HTTPBasicAuth('tankjam', 'kermit46709394'))
print(response.text)
# 測試3,通過requests.get請求內的auth參數默認就是HTTPBasicAuth,認證成功返回用戶信息
response = requests.get(url, headers=HEADERS, auth=('tankjam', 'kermit46709394'))
print(response.text)
'''
打印結果:
{
"login": "TankJam",
"id": 38001458,
"node_id": "MDQ6VXNlcjM4MDAxNDU4",
"avatar_url": "https://avatars2.githubusercontent.com/u/38001458?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/TankJam",
"html_url": "https://github.com/TankJam",
"followers_url": "https://api.github.com/users/TankJam/followers",
"following_url": "https://api.github.com/users/TankJam/following{/other_user}",
"gists_url": "https://api.github.com/users/TankJam/gists{/gist_id}",
"starred_url": "https://api.github.com/users/TankJam/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/TankJam/subscriptions",
"organizations_url": "https://api.github.com/users/TankJam/orgs",
"repos_url": "https://api.github.com/users/TankJam/repos",
"events_url": "https://api.github.com/users/TankJam/events{/privacy}",
"received_events_url": "https://api.github.com/users/TankJam/received_events",
"type": "User",
"site_admin": false,
"name": "kermit",
"company": null,
"blog": "",
"location": null,

"email": null,
"hireable": null,
"bio": null,
"public_repos": 6,
"public_gists": 0,
"followers": 0,
"following": 0,
"created_at": "2018-04-02T09:39:33Z",
"updated_at": "2019-05-14T07:47:20Z",
"private_gists": 0,
"total_private_repos": 1,
"owned_private_repos": 1,
"disk_usage": 8183,
"collaborators": 0,
"two_factor_authentication": false,
"plan": {
"name": "free",
"space": 976562499,
"collaborators": 0,
"private_repos": 10000
}
}
'''

5、異常處理

# 異常處理
import requests
from requests.exceptions import * # 可以查看requests.exceptions獲取異常類型
try:
r = requests.get('http://www.baidu.com', timeout=0.00001)
except ReadTimeout:
print('===:')
except ConnectionError: # 網絡不通
print('-----')
except Timeout:
print('aaaaa')
except RequestException:
print('Error')

6、上傳文件

# 6.上傳文件
import requests
# 上傳文本文件
files1 = {'file': open('user.txt', 'rb')}
response = requests.post('http://httpbin.org/post', files=files1)
print(response.status_code) # 200
print(response.text) # 200
# 上傳圖片文件
files2 = {'jpg': open('小狗.jpg', 'rb')}
response = requests.post('http://httpbin.org/post', files=files2)
print(response.status_code) # 200
print(response.text) # 200
# 上傳視頻文件
files3 = {'movie': open('love_for_GD.mp4', 'rb')}
response = requests.post('http://httpbin.org/post', files=files3)
print(response.status_code) # 200
print(response.text) # 200
"


分享到:


相關文章: