Python爬蟲自學筆記:爬取小說(五)



書接上文,前面代碼實現了根據txt鏈接實現小說下載,本文主要實現根據提供的小說名稱進行網站檢索,返回下載鏈接,並對小說下載。

1 網站分析

網站檢索頁面地址為:https://www.555x.org/search.html

分析檢索頁面中要求輸入書名處,採用post方法,輸入的小說名稱賦予參數searchkey。由此可以採用requests.post()請求,發送字典{"searchkey":"小說名稱"}來獲取網站檢索界面,在返回列表中可以提取小說網址信息。

2 編碼思路

1) 提供小說名稱;

2) 在小說網站檢索小說,提取小說對應編號;

3) 根據編號得出下載鏈接,進而下載小說。

3 代碼實現

源碼如下:

<code># crawl_v1.4 # 爬取小說txt文件 import requests from bs4 import BeautifulSoup import time import proxy_ip # 獲取小說檢索結果 def get_search(novel, proxy): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.8", "Cache-Control": "max-age=0", "Connection": "keep-alive"} try: r = requests.post("https://www.555x.org/search.html",{"searchkey":novel},headers = headers,proxies = proxy) r.raise_for_status() except: proxy = proxy_ip.get_random_ip() print("更換代理IP") r = requests.post("https://www.555x.org/search.html",{"searchkey":novel},headers = headers,proxies = proxy) soup = BeautifulSoup(r.text,"html.parser") qq_g = soup.find_all("li","qq_g") link = "" for i in qq_g: s = i.text.find("》") # 提取請求結果的小說全名,並與輸入小說名稱對比, # 相同則賦值link鏈接地址並結束循環,不相同則默認link為空 if i.text[1:s] == novel: link = i.a.get("href") break return link # 下載小說 def novle_download(novel,n, proxy): l = "https://www.555x.org/home/down/txt/id/" + n headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.8", "Cache-Control": "max-age=0", "Connection": "keep-alive"} try: r = requests.get(l,headers = headers,proxies = proxy) r.raise_for_status() except: proxy = proxy_ip.get_random_ip() print("更換代理IP") r = requests.get(l, headers=headers, proxies=proxy) # 保存小說到本地 with open(novel + ".txt","w",encoding="ISO-8859-1") as f: f.write(r.text) if __name__ == "__main__": start_time = time.time() novel = input("輸入小說名稱:") proxy = proxy_ip.get_random_ip() novel_link = get_search(novel,proxy) # 獲取小說搜索結果 if novel_link == "": print("網站中無此小說") else: s = novel_link.find("txt") e = novel_link.find(".html") novel_number = novel_link[s+3:e] # 提取小說編號 novle_download(novel,novel_number,proxy) #下載小說 # 獲取小說下載時間 end_time = time.time() print("運行時間:" + str(round(end_time - start_time)) + "s")/<code>

運行結果:

4 相關學習知識點

1) input輸入函數;

2) PyCharm中快速創建函數:選中要創建的函數,按快捷鍵alt+enter;

3) requests.post()請求;

8 結束語

本次代碼實現了根據提供的小說名稱進行小說下載的功能。

將此次編碼的過程及源碼分享給大家,供大家參考。對於錯誤的地方,或有更好的建議,希望大家提出來,不勝感激!