使用BeautifulSoup模块
使用正则表达式
使用到多线程爬取
使用说明
使用前请安装BeauifulSoup
运行程序后会在当前目录下生成txt文件,内容为json格式.如下所示:无私分享全套Python爬虫干货,如果你也想学习Python,@ 私信小编获取
{“branch_first_letter”: “S”, “branch_name”: “萨博”, “branch_id”: “64”, “producer”: “萨博”, “producer_id”: “”, “car_series”: “Saab 900”, “car_series_id”: “s2630”, “car_price”: “暂无报价”}
源代码
<code>import
jsonfrom
multiprocessingimport
Poolimport
requestsfrom
requests.exceptionsimport
RequestExceptionimport
refrom
bs4import
BeautifulSoupdef
get_one_page
(url)
:""" 请求网页函数. :param url: :return: """
headers = {'User-Agent'
:'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Firefox/68.0'
}try
: response = requests.get(url, headers=headers) print(response.status_code)if
response.status_code ==200
:return
response.textreturn
None
except
RequestException:return
None
def
parse_one_page
(html, first_letter)
:""" 网页处理函数, 生成器 :param html: :param first_letter: :return:iterable """
soup = BeautifulSoup(html,'lxml'
) info = {'branch_first_letter'
:''
,'branch_name'
:''
,'branch_id'
:''
,'producer'
:''
,'producer_id'
:''
,'car_series'
:''
,'car_series_id'
:''
,'car_price'
:''
} branches = soup.find_all('dl'
)for
branchin
branches: info['branch_name'
] = branch.dt.div.a.string.strip() info['branch_id'
] = branch['id'
] info['branch_first_letter'
] = first_letter print('正在抓取...品牌:'
, info['branch_name'
]) block = branch.find_all('dd'
) soup = BeautifulSoup(str(block),'lxml'
) producers = soup.find_all('div'
, attrs={'class'
:'h3-tit'
})for
producerin
producers: info['producer'
] = producer.a.get_text().strip() info['producer_id'
] =''
print('正在抓取...生产商:'
, info['producer'
]) cars = producer.find_next('ul'
)for
carin
cars.find_all('li'
, attrs={'id'
:True
}): info['car_series_id'
] = car['id'
] info['car_series'
] = car.h4.a.get_text().strip() price = car.find_all('a'
, attrs={'class'
:True
,'data-value'
:False
})if
price: print(price[0
].get_text())if
re.match('.*?万.*?'
, price[0
].get_text(), re.S): info['car_price'
] = price[0
].get_text().strip()else
: info['car_price'
] ='暂无报价'
yield
infodef
write_file
(content)
:""" 将抓取数据保存成Json文件 :param content: :return: None """
with
open('cars.txt'
,'a'
, encoding='utf-8'
)as
f: f.write(json.dumps(content, ensure_ascii=False
) +'\n'
) f.close()def
main
(first_letter)
:""" 主函数 :param first_letter: :return: None """
html = get_one_page('https://www.autohome.com.cn/grade/carhtml/'
+ first_letter +'.html'
) soup = BeautifulSoup(html,'lxml'
) html = soup.prettify()for
itemin
parse_one_page(html, first_letter): write_file(item)if
__name__ =='__main__'
:for
letterin
[chr(i + ord('A'
))for
iin
range(26
)]: main(letter) /<code>
大家可能会问:为什么爬取个简单的数据还要三层循环?我主要考虑到数据之间的关联性、层级性才使用了三层循环,这样才能保证数据之间的层级关系保持不乱。
编写代码过程中遇到BeautifulSoup中,find_all()方法如果只需要确定是否存在某个属性,而不指定具体属性值,可以写成下面这样:
<code>car.find_all('a'
, attrs={'class'
:True
,'data-value'
:False
})/<code>
为了帮助大家更轻松的学好Python,我给大家分享一套Python学习资料,希望对正在学习的你有所帮助!
获取方式:关注并私信小编 “ 学习 ”,即可免费获取!