程序员带你轻松爬取汽车之家大数据

使用BeautifulSoup模块

使用正则表达式

使用到多线程爬取

使用说明

使用前请安装BeauifulSoup

运行程序后会在当前目录下生成txt文件,内容为json格式.如下所示:无私分享全套Python爬虫干货,如果你也想学习Python,@ 私信小编获取

{“branch_first_letter”: “S”, “branch_name”: “萨博”, “branch_id”: “64”, “producer”: “萨博”, “producer_id”: “”, “car_series”: “Saab 900”, “car_series_id”: “s2630”, “car_price”: “暂无报价”}

源代码

<code> 
 
 
 
 
 
 

import

json

from

multiprocessing

import

Pool

import

requests

from

requests.exceptions

import

RequestException

import

re

from

bs4

import

BeautifulSoup

def

get_one_page

(url)

:

""" 请求网页函数. :param url: :return: """

headers = {

'User-Agent'

:

'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Firefox/68.0'

}

try

: response = requests.get(url, headers=headers) print(response.status_code)

if

response.status_code ==

200

:

return

response.text

return

None

except

RequestException:

return

None

def

parse_one_page

(html, first_letter)

:

""" 网页处理函数, 生成器 :param html: :param first_letter: :return:iterable """

soup = BeautifulSoup(html,

'lxml'

) info = {

'branch_first_letter'

:

''

,

'branch_name'

:

''

,

'branch_id'

:

''

,

'producer'

:

''

,

'producer_id'

:

''

,

'car_series'

:

''

,

'car_series_id'

:

''

,

'car_price'

:

''

} branches = soup.find_all(

'dl'

)

for

branch

in

branches: info[

'branch_name'

] = branch.dt.div.a.string.strip() info[

'branch_id'

] = branch[

'id'

] info[

'branch_first_letter'

] = first_letter print(

'正在抓取...品牌:'

, info[

'branch_name'

]) block = branch.find_all(

'dd'

) soup = BeautifulSoup(str(block),

'lxml'

) producers = soup.find_all(

'div'

, attrs={

'class'

:

'h3-tit'

})

for

producer

in

producers: info[

'producer'

] = producer.a.get_text().strip() info[

'producer_id'

] =

''

print(

'正在抓取...生产商:'

, info[

'producer'

]) cars = producer.find_next(

'ul'

)

for

car

in

cars.find_all(

'li'

, attrs={

'id'

:

True

}): info[

'car_series_id'

] = car[

'id'

] info[

'car_series'

] = car.h4.a.get_text().strip() price = car.find_all(

'a'

, attrs={

'class'

:

True

,

'data-value'

:

False

})

if

price: print(price[

0

].get_text())

if

re.match(

'.*?万.*?'

, price[

0

].get_text(), re.S): info[

'car_price'

] = price[

0

].get_text().strip()

else

: info[

'car_price'

] =

'暂无报价'

yield

info

def

write_file

(content)

:

""" 将抓取数据保存成Json文件 :param content: :return: None """

with

open(

'cars.txt'

,

'a'

, encoding=

'utf-8'

)

as

f: f.write(json.dumps(content, ensure_ascii=

False

) +

'\n'

) f.close()

def

main

(first_letter)

:

""" 主函数 :param first_letter: :return: None """

html = get_one_page(

'https://www.autohome.com.cn/grade/carhtml/'

+ first_letter +

'.html'

) soup = BeautifulSoup(html,

'lxml'

) html = soup.prettify()

for

item

in

parse_one_page(html, first_letter): write_file(item)

if

__name__ ==

'__main__'

:

for

letter

in

[chr(i + ord(

'A'

))

for

i

in

range(

26

)]: main(letter) /<code>

大家可能会问:为什么爬取个简单的数据还要三层循环?我主要考虑到数据之间的关联性、层级性才使用了三层循环,这样才能保证数据之间的层级关系保持不乱。

编写代码过程中遇到BeautifulSoup中,find_all()方法如果只需要确定是否存在某个属性,而不指定具体属性值,可以写成下面这样:

<code>car.find_all(

'a'

, attrs={

'class'

:

True

,

'data-value'

:

False

})/<code>

为了帮助大家更轻松的学好Python,我给大家分享一套Python学习资料,希望对正在学习的你有所帮助!

获取方式:关注并私信小编 “ 学习 ”,即可免费获取!


程序员带你轻松爬取汽车之家大数据


程序员带你轻松爬取汽车之家大数据


分享到:


相關文章: