Tool 003-Python Scrapy 爬取校花照片

Python Tool 101 - Tool 003 - Python Scrapy 爬取图片数据

环境背景:

昨天我们爬取了文本数据,感觉不是很过瘾,只有文字没有图片!那么今天我们就爬取些图片数据吧,这次我们选取校花网,爬取校花们的图片吧!

值得注意的是,我将把爬虫和下载图片的函数分隔开,虽然我们可以将这两个步骤合二为一,但是显然这样效率不是很高,下载图片比爬取文本信息太慢了。

提出问题:

Python Scrapy 爬取图片数据

解决方案:

  • 爬虫选用 scrapy框架
  • 数据库选用 mongodb
  • 爬取校花网校花图片地址
  • 读取 mongodb 数据库
  • 下载校花网校花图片

实际操作:

第一步:

# 安装scrapy
python -m pip install scrapy

第二步:

# 安装mongodb
wget http://172.16.1.150/mongo/mongodb-org-3.4.repo && \
yum makecache && \
yum -y install mongodb-org
systemctl start mongod.service
systemctl enable mongod.service

第三步:

# 设置items
vi items.py
```
import scrapy
from scrapy import Item,Field

class TutorialItem(scrapy.Item):
# 校花姓名
girls_name = scrapy.Field()
# 学校名
school_name = scrapy.Field()
# 下载地址
download_link = scrapy.Field()
```

# 设置pipelines
vi pipelines.py
```
class TutorialPipeline(object):
def __init__(self):
conn = MongoClient('172.10.2.105', 27017)
db = conn.xiaohuar
self.post = db.Img

def process_item(self, item, spider):
imgInfo = dict(item)
self.post.insert(imgInfo)
return item

```


# 设置spider
vi doubanspider.py
```
from scrapy.spiders import CrawlSpider
from xiaohuar.items import XiaohuarItem
import scrapy

class XiaohuarSpider(CrawlSpider):
name = "xiaohuar"
offset = 0
url = "http://www.xiaohuar.com/list-1-"
tail = ".html"
start_urls = (
url + str(offset) + tail,
)

def parse(self,response):
item = XiaohuarItem()
images = response.xpath('//div[@class="img"]')
print('*'*80)
print(len(images))
for girls_photos in images:
# 校花名
item['girls_name'] = girls_photos.xpath('.//span[@class="price"]/text()').extract()[0]
# 学校名
item['school_name'] = girls_photos.xpath('.//div[@class="btns"]/a/text()').extract()[0]
# 图片地址
item['download_link'] = girls_photos.xpath('.//a/img/@src').extract()[0]
yield item

if self.offset < 44:
self.offset += 1
yield scrapy.Request(self.url + str(self.offset) + str(self.tail), callback=self.parse)
```

第四步:

# 启动爬虫,爬取图片数据
scrapy crawl xiaohuar

第五步:

查看数据库数据吧~~~

Tool 003-Python Scrapy 爬取校花照片

第六步:

现在设计下载图片功能,download_img.py

#!/bin/bash
#coding=utf-8
import os
import urllib.request
from pymongo import MongoClient

def save_img(girls_name,school_name,download_link,file_path):
#保存图片到磁盘文件夹 file_path中,默认为当前脚本运行目录下的 file_path文件夹
try:
if not os.path.exists(file_path):
print('文件夹',file_path,'不存在,重新建立')
os.makedirs(file_path)
check = download_link.split('/')
# 图片地址需要加工一下,方可进行下载
if check[0] == "":
file_suffix = download_link.split('.')[1]
file_url = "http://www.xiaohuar.com" + download_link
elif "www" in check[0]:
file_suffix = download_link.split('.')[3]
file_url = "http://" + download_link
elif "http" in check[0]:
file_suffix = download_link.split('.')[3]
file_url = download_link
file_name = file_path + girls_name + "-" + school_name + "." + file_suffix
urllib.request.urlretrieve(file_url,file_name)
except IOError as e:
print('文件操作失败',e)
except Exception as e:
print('错误 :',e)

file_path = "/root/xiaohuar_images/"
# 读取mongodb数据库
conn = MongoClient('172.10.2.105', 27017)
db = conn.xiaohuar

db.Img.count()
for i in range(0,db.Img.count()):
list = db.Img.find()[i]
download_link = list['download_link']
girls_name = list['girls_name']
school_name = list['school_name']
save_img(girls_name,school_name,download_link,file_path)

第七步:

python download_img.py

第八步:

最后我做了展示界面,就可以舒服的查阅最终的效果~

Tool 003-Python Scrapy 爬取校花照片

Tool 003-Python Scrapy 爬取校花照片

Tool 003-Python Scrapy 爬取校花照片


分享到:


相關文章: