Python爬虫自学笔记:爬取小说(五)



书接上文,前面代码实现了根据txt链接实现小说下载,本文主要实现根据提供的小说名称进行网站检索,返回下载链接,并对小说下载。

1 网站分析

网站检索页面地址为:https://www.555x.org/search.html

分析检索页面中要求输入书名处,采用post方法,输入的小说名称赋予参数searchkey。由此可以采用requests.post()请求,发送字典{"searchkey":"小说名称"}来获取网站检索界面,在返回列表中可以提取小说网址信息。

2 编码思路

1) 提供小说名称;

2) 在小说网站检索小说,提取小说对应编号;

3) 根据编号得出下载链接,进而下载小说。

3 代码实现

源码如下:

<code># crawl_v1.4 # 爬取小说txt文件 import requests from bs4 import BeautifulSoup import time import proxy_ip # 获取小说检索结果 def get_search(novel, proxy): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.8", "Cache-Control": "max-age=0", "Connection": "keep-alive"} try: r = requests.post("https://www.555x.org/search.html",{"searchkey":novel},headers = headers,proxies = proxy) r.raise_for_status() except: proxy = proxy_ip.get_random_ip() print("更换代理IP") r = requests.post("https://www.555x.org/search.html",{"searchkey":novel},headers = headers,proxies = proxy) soup = BeautifulSoup(r.text,"html.parser") qq_g = soup.find_all("li","qq_g") link = "" for i in qq_g: s = i.text.find("》") # 提取请求结果的小说全名,并与输入小说名称对比, # 相同则赋值link链接地址并结束循环,不相同则默认link为空 if i.text[1:s] == novel: link = i.a.get("href") break return link # 下载小说 def novle_download(novel,n, proxy): l = "https://www.555x.org/home/down/txt/id/" + n headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.8", "Cache-Control": "max-age=0", "Connection": "keep-alive"} try: r = requests.get(l,headers = headers,proxies = proxy) r.raise_for_status() except: proxy = proxy_ip.get_random_ip() print("更换代理IP") r = requests.get(l, headers=headers, proxies=proxy) # 保存小说到本地 with open(novel + ".txt","w",encoding="ISO-8859-1") as f: f.write(r.text) if __name__ == "__main__": start_time = time.time() novel = input("输入小说名称:") proxy = proxy_ip.get_random_ip() novel_link = get_search(novel,proxy) # 获取小说搜索结果 if novel_link == "": print("网站中无此小说") else: s = novel_link.find("txt") e = novel_link.find(".html") novel_number = novel_link[s+3:e] # 提取小说编号 novle_download(novel,novel_number,proxy) #下载小说 # 获取小说下载时间 end_time = time.time() print("运行时间:" + str(round(end_time - start_time)) + "s")/<code>

运行结果:

4 相关学习知识点

1) input输入函数;

2) PyCharm中快速创建函数:选中要创建的函数,按快捷键alt+enter;

3) requests.post()请求;

8 结束语

本次代码实现了根据提供的小说名称进行小说下载的功能。

将此次编码的过程及源码分享给大家,供大家参考。对于错误的地方,或有更好的建议,希望大家提出来,不胜感激!