2018-12-04 17:43:12 崔格拉斯

Python Tool 101 - Tool 003 - Python Scrapy 爬取图片数据

环境背景：

昨天我们爬取了文本数据，感觉不是很过瘾，只有文字没有图片！那么今天我们就爬取些图片数据吧，这次我们选取校花网，爬取校花们的图片吧！

值得注意的是，我将把爬虫和下载图片的函数分隔开，虽然我们可以将这两个步骤合二为一，但是显然这样效率不是很高，下载图片比爬取文本信息太慢了。

提出问题：

Python Scrapy 爬取图片数据

解决方案：

爬虫选用 scrapy框架
数据库选用 mongodb
爬取校花网校花图片地址
读取 mongodb 数据库
下载校花网校花图片

实际操作：

第一步：

# 安装scrapy
python -m pip install scrapy

第二步：

# 安装mongodb
wget http://172.16.1.150/mongo/mongodb-org-3.4.repo && \
 yum makecache && \
 yum -y install mongodb-org
 systemctl start mongod.service
 systemctl enable mongod.service

第三步：

# 设置items
vi items.py
```
import scrapy
from scrapy import Item,Field

class TutorialItem(scrapy.Item):
 # 校花姓名
 girls_name = scrapy.Field()
 # 学校名
 school_name = scrapy.Field()
 # 下载地址
 download_link = scrapy.Field()
```

# 设置pipelines
vi pipelines.py
```
class TutorialPipeline(object):
 def __init__(self):
 conn = MongoClient('172.10.2.105', 27017)
 db = conn.xiaohuar
 self.post = db.Img

 def process_item(self, item, spider):
 imgInfo = dict(item)
 self.post.insert(imgInfo)
 return item

``` 


# 设置spider
vi doubanspider.py
```
from scrapy.spiders import CrawlSpider
from xiaohuar.items import XiaohuarItem
import scrapy

class XiaohuarSpider(CrawlSpider):
 name = "xiaohuar"
 offset = 0
 url = "http://www.xiaohuar.com/list-1-"
 tail = ".html"
 start_urls = (
 url + str(offset) + tail,
 )

 def parse(self,response):
 item = XiaohuarItem()
 images = response.xpath('//div[@class="img"]')
 print('*'*80)
 print(len(images))
 for girls_photos in images:
 # 校花名
 item['girls_name'] = girls_photos.xpath('.//span[@class="price"]/text()').extract()[0]
 # 学校名
 item['school_name'] = girls_photos.xpath('.//div[@class="btns"]/a/text()').extract()[0]
 # 图片地址
 item['download_link'] = girls_photos.xpath('.//a/img/@src').extract()[0]
 yield item

 if self.offset < 44:
 self.offset += 1
 yield scrapy.Request(self.url + str(self.offset) + str(self.tail), callback=self.parse) 
```

第四步：

# 启动爬虫，爬取图片数据
scrapy crawl xiaohuar

第五步：

查看数据库数据吧~~~

第六步：

现在设计下载图片功能，download_img.py

#!/bin/bash
#coding=utf-8
import os
import urllib.request
from pymongo import MongoClient

def save_img(girls_name,school_name,download_link,file_path):
 #保存图片到磁盘文件夹 file_path中，默认为当前脚本运行目录下的 file_path文件夹
 try:
 if not os.path.exists(file_path):
 print('文件夹',file_path,'不存在，重新建立')
 os.makedirs(file_path)
 check = download_link.split('/')
 # 图片地址需要加工一下，方可进行下载
 if check[0] == "":
 file_suffix = download_link.split('.')[1]
 file_url = "http://www.xiaohuar.com" + download_link
 elif "www" in check[0]:
 file_suffix = download_link.split('.')[3]
 file_url = "http://" + download_link
 elif "http" in check[0]:
 file_suffix = download_link.split('.')[3]
 file_url = download_link
 file_name = file_path + girls_name + "-" + school_name + "." + file_suffix
 urllib.request.urlretrieve(file_url,file_name)
 except IOError as e:
 print('文件操作失败',e)
 except Exception as e:
 print('错误 ：',e)

file_path = "/root/xiaohuar_images/"
# 读取mongodb数据库
conn = MongoClient('172.10.2.105', 27017)
db = conn.xiaohuar 

db.Img.count()
for i in range(0,db.Img.count()):
 list = db.Img.find()[i]
 download_link = list['download_link']
 girls_name = list['girls_name']
 school_name = list['school_name']
 save_img(girls_name,school_name,download_link,file_path)