1. 海王评论数据爬取前分析
海王上映了,然后口碑炸了,对咱来说,多了一个可爬可分析的电影,美哉~
摘录一个评论
零点场刚看完,温导的电影一直很不错,无论是速7,电锯惊魂还是招魂都很棒。打斗和音效方面没话说非常棒,特别震撼。总之,DC扳回一分( ̄▽ ̄)。比正义联盟好的不止一点半点(我个人感觉)。还有艾梅伯希尔德是真的漂亮,温导选的人都很棒。 真的第一次看到这么牛逼的电影 转场特效都吊炸天
2. 海王案例开始爬取数据
数据爬取的依旧是猫眼的评论,这部分内容咱们用把牛刀,scrapy爬取,一般情况下,用一下requests就好了
抓取地址
<code>http
://m.maoyan.com/mmdb/comments/movie/249342.json?_v_=yes&offset=15&startTime=2018-12-11%2009
%3
A58%3
A43/<code>
关键参数
<code>url
:http
:offset
:15
startTime
:起始时间/<code>
scrapy 爬取猫眼代码特别简单,我分开几个py文件即可。Haiwang.py
<code>import
scrapyimport
jsonfrom
haiwang.itemsimport
HaiwangItemclass
HaiwangSpider
(scrapy.Spider)
: name ='Haiwang'
allowed_domains = ['m.maoyan.com'
] start_urls = ['http://m.maoyan.com/mmdb/comments/movie/249342.json?_v_=yes&offset=0&startTime=0'
]def
parse
(self, response)
: print(response.url) body_data = response.body_as_unicode() js_data = json.loads(body_data) item = HaiwangItem()for
infoin
js_data["cmts"
]: item["nickName"
] = info["nickName"
] item["cityName"
] = info["cityName"
]if
"cityName"
in
infoelse
""
item["content"
] = info["content"
] item["score"
] = info["score"
] item["startTime"
] = info["startTime"
] item["approve"
] = info["approve"
] item["reply"
] = info["reply"
] item["avatarurl"
] = info["avatarurl"
]yield
itemyield
scrapy.Request("http://m.maoyan.com/mmdb/comments/movie/249342.json?_v_=yes&offset=0&startTime={}"
.format(item["startTime"
]),callback=self.parse)/<code>
setting.py
设置需要配置headers
<code>DEFAULT_REQUEST_HEADERS = {"Referer"
:"http://m.maoyan.com/movie/249342/comments?_v_=yes"
,"User-Agent"
:"Mozilla/5.0 Chrome/63.0.3239.26 Mobile Safari/537.36"
,"X-Requested-With"
:"superagent"
}/<code>
需要配置一些抓取条件
<code>ROBOTSTXT_OBEY
=False
DOWNLOAD_DELAY
=1
COOKIES_ENABLED
=False
/<code>
开启管道
<code>ITEM_PIPELINES
=
{
'haiwang.pipelines.HaiwangPipeline':
300
,
}
/<code>
items.py 获取你想要的数据
<code>import
scrapy
class
HaiwangItem(scrapy.Item):
nickName
=scrapy.Field()
cityName
=scrapy.Field()
content
=scrapy.Field()
score
=scrapy.Field()
startTime
=scrapy.Field()
approve
=scrapy.Field()
reply
=scrapy.Field()
avatarurl
=scrapy.Field()
/<code>
pipelines.py 保存数据,数据存储到csv文件中
<code>import os import csvclass
HaiwangPipeline
(object
):def
__init__
(
self
): store_file = os.path.dirname(__file__
) +'/spiders/haiwang.csv'
self
.file = open(store_file,"a+"
, newline=""
, encoding="utf-8"
)self
.writer = csv.writer(self
.file)
def
process_item
(
self
, item, spider):try:
self
.writer.writerow(( item["nickName"
], item["cityName"
], item["content"
], item["approve"
], item["reply"
], item["startTime"
], item["avatarurl"
], item["score"
] )) except Exception ase:
print(e.args)def
close_spider
(
self
, spider):self
.file.close()/<code>
begin.py 编写运行脚本
<code>from scrapyimport
cmdline cmdline.execute(("scrapy crawl Haiwang"
).split
())/<code>
搞定,等着数据来到,就可以了,更多技术文章及素材资料、可关注公众号python社区营