利用Python語言功能對網站電影短評的爬取:網絡爬蟲

在上一篇文章裡介紹了利用find_all的方法來對網絡上電影的排名,今天我們介紹一個利用select函數的方法。

方法二:利用select函數

#x1=soup.find_all("li")

x1 = soup.select("ol li") #子目錄標籤,各級標籤之間利用空格進行分割

1.5影片短評的爬取

#循環獲得短影評,電影編號 32659890

import requests as rt
from bs4 import BeautifulSoup as bs
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
#循環獲得短影評


k=range(0,1200,20)
my_txt=""
for
i in k:
#my_url=my_list1[0]+r"comments?start="+str(i)+r"&limit=20&sort=new_score&status=P"
my_url =r"https://movie.douban.com/subject/32659890/"\\ #電影編號32659890


+ r"comments?start=" + str(i) \\
+ r"&limit=20&sort=new_score&status=P"
#print(my_url)
my_data=rt.get(my_url, headers=headers, timeout=30)
my_data.encoding="utf-8" #解決中文亂碼問題


soup=bs(my_data.text,"html.parser")
#print(soup)
x1=soup.find_all("span",class_="short")
for i in x1:
#print(i.text)
my_txt=my_txt+i.text
#詞頻統計


import jieba
x3=jieba.lcut(my_txt)#分詞後的文章


x4=dict()
#屏蔽詞


y1=["\\r\\n","一部"]
for i in x3:
if i not in y1 and len(i) > 1:
x4[i]=x4.get(i,1)+1
#排序


res = sorted(x4.items(),key=lambda d:d[1],reverse=True)
#print(type(res),len(res))
for i in range(0,10):#詞頻的前10名


print(res[i])
#詞雲


from wordcloud import WordCloud
#圖形


import cv2 #pip install opencv-python
mask =cv2.imread("000.jpg") #讀取背景圖片



my_txt2=" ".join(x3)
excludes=["\\r\\n","一部","電影","片子","沒有","這部","一個"]
wordcloud = WordCloud(background_color="white",
width=800,
height=600,
font_path="msyh.ttc",
max_words=200,
max_font_size=200,
stopwords = excludes,
mask=mask
).generate(my_txt2)
wordcloud.to_file('111.jpg')
#顯示圖片


from PIL import Image
img=Image.open('111.jpg')
img.show()


利用Python語言功能對網站電影短評的爬取:網絡爬蟲

顯示圖

1.6 爬取時光網排名

<code>import requestsfrom bs4 import BeautifulSoup as bsx2=range(1,11)for my_n in x2: if my_n==1: my_url=r"http://www.mtime.com/top/movie/top100/" else: my_url=r"http://www.mtime.com/top/movie/top100/index-"+str(my_n)+r".html" print(my_url) my_data=requests.get(my_url) my_data.encoding="utf-8" #解決中文亂碼問題 soup=bs(my_data.text,"html.parser") x1=soup.find_all('li') for i in x1: if i.find("div",class_="number")!=None: print(i.find("em").text,end=" ") try: print(i.find("a",class_="c_fff").text,end=" ") except: print(i.find("a",class_="c_blue").text,end=" ") print(i.find("span",class_="total").text,end="") print(i.find("span", class_="total2").text)/<code>


分享到:


相關文章: