Python初學小爬蟲之Scrapy爬取美女壁紙圖片（附全部源碼）_技术 _ 頭條網

項目地址：

https://github.com/JunWangCode/picScrapy.git

此爬蟲可以爬取http://www.jj20.com所有分類圖片，僅作實驗和參考，請勿用於其他用途,在Ubuntu系統Python3環境下完美運行

一些設置

# 爬取深度DEPTH_LIMIT = 5# 圖片存放位置IMAGES_STORE = '/home/jwang/Videos/Pic'# 圖片最小寬度IMAGES_MIN_WIDTH = 500# 圖片最小高度IMAGES_MIN_HEIGHT = 500

還有一些選項需要注意：

# 下載延遲，別把別人人站點拖垮了，慢點DOWNLOAD_DELAY = 0.2# 爬蟲併發數，默認是 16CONCURRENT_REQUESTS = 20

啟動爬蟲

python3 -m scrapy crawl pic

核心代碼：

# -*- coding: utf-8 -*-

3 # Define here the models for your scraped items

4 #

5 # See documentation in:

6 # http://doc.scrapy.org/en/latest/topics/items.html

8 import scrapy

11 class PicscrapyItem(scrapy.Item):

12 image_urls = scrapy.Field()

13 images = scrapy.Field()

14 title = scrapy.Field()

15 category_name = scrapy.Field()

18 # 商品數據

19 class AfscrapyItem(scrapy.Item):

20 goods_id = scrapy.Field()

21 shop_name = scrapy.Field()

22 category_name = scrapy.Field()

23 title = scrapy.Field()

24 sales_num = scrapy.Field()

25 unit = scrapy.Field()

26 price = scrapy.Field()

27 location = scrapy.Field()

# -*- coding: utf-8 -*-

3 # Define your item pipelines here

4 #

5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting

6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

7 # -*- coding: utf-8 -*-

8 from urllib.parse import urlparse

9 import pymysql

10 import time

11 from scrapy.pipelines.images import ImagesPipeline

12 from scrapy import Request

15 def db_handler():

16 conn = pymysql.connect(

17 host='192.168.0.111',

18 user='root',

19 passwd='',

20 charset='utf8',

21 db='scrapy_data',

22 use_unicode=True

23 )

24 conn.autocommit(True)

25 return conn

28 class PicscrapyPipeline(ImagesPipeline):

29 def get_media_requests(self, item, info):

30 # 通過meta屬性傳遞title

31 return [Request(x, meta={'title': item['title'], 'cat': item['category_name']}) for x in

32 item.get(self.images_urls_field, [])]

34 # 重寫函數，修改了下載圖片名稱的生成規則

35 def file_path(self, request, response=None, info=None):

36 if not isinstance(request, Request):

37 url = request

38 else:

39 url = request.url

40 url = urlparse(url)

41 img_name = url.path.split('/')[5].split('.')[0]

42 return request.meta['cat'] + '/' + request.meta['title'] + '/%s.jpg' % img_name

45 class WebcrawlerScrapyPipeline(object):

46 def __init__(self):

47 self.db_object = db_handler()

48 self.cursor = db_handler().cursor()

50 def process_item(self, item, spider):

51 if item['category_name'] == "全部":

52 return

53 try:

54 sql = "insert into " + spider.name + "(goods_id, shop_name, " \

55 "category_name, title, sales_num, unit, price, location, created_at)" \

56 "values (%s, %s, %s, %s, %s, %s, %s, %s, %s)"

57 params = (

58 item['goods_id'], item['shop_name'], item['category_name'], item['title'],

59 item['sales_num'], item['unit'], item['price'], item['location'],

60 time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())

61 )

62 self.cursor.execute(sql, params)

64 except RuntimeError as e:

65 self.db_object.rollback()

66 print(e)

68 return item

相關文章:

python網絡爬蟲-爬取《鬥破蒼穹》全文小說 源碼

Python 爬取 6000 篇文章分析 CSDN 是如何進入微信 500 強的

第二章 IoC容器和Bean配置

運算裡不得不說的python模塊—math

Devops度量--DevOps 現狀快速檢查表

SOP是什麼（解讀）

還不知道交換機上如何配置DHCP，趕緊過來圍觀吧，一分鐘包你學會

還在手動配置IP地址嗎？太Low了，一分鐘教會您如何配置DHCP

Python爬蟲自學筆記：分析頭條文章網頁源文件

DNS偵查工具

國人開源的異步 Python ORM：GINO

程序測評：Create React App 3.3中有哪些酷炫新功能？

“明學”的魅力？我只要我覺得：駕馭終端，提高生產力

（必收藏系列）Linux面試題——命令集

五分鐘學會如何在 IPFS 上部署網站

「正點原子NANO STM32F103開發板資料連載」第29章 內存管理實驗

小白怎麼學Web前端開發 如何成為技術達人

如何開發一個web靜態服務器

學Java編程還有前景嗎 如何才能拿到高薪

Python網絡爬蟲之配置篇（一）

SpringBoot 整合SpringSecurity示例實現前後分離權限註解+JWT登錄認證

Python的運行效率太低？幾行代碼快速提升！

python的優點是什麼？最新Python400集視頻（附教程）

MySQL中OOM故障應如何下手-愛可生

像專家一樣使用 panic

30種不同的編程語言怎麼寫“Hello, World”

percona QAN 介紹

面試官：你可以用純CSS判斷鼠標進入的方向嗎？

網絡工程師職業生涯中，哪兩點是最重要的？

交換機中相關術語代表什麼意思，有必要弄清楚

由淺入深瞭解以太坊 2.0：最常見問題和最全學習清單

【Linux簡單實用小命令001】CentOS 7、8的防火牆端口開放

吃透這些IPFS硬核知識點，日後搶頭礦隨時“彎道超車”

Hive分桶表

Spring中資源的加載原來是這麼一回事啊！

自己動手搭建郵件系統：怎樣讓Exchange Server 發出第一封郵件？

【MySQL】RDS物理備份文件(.idb\.frm)恢復到MySQL自建數據庫

NLP算法入門系列：隱含馬爾可夫鏈(HMM)模型的簡單介紹

第一章 Spring Framework概述

opencv人工智能深度學習這樣實現人臉的年齡檢測

嵌入式linux網絡編程之——5年程序員給你深度講解socket套接字

深入瞭解ProcessFunction的狀態操作(Flink-1.10)

python網絡爬蟲-爬取《鬥破蒼穹》全文小說源碼

「正點原子NANO STM32F103開發板資料連載」第29章內存管理實驗

小白怎麼學Web前端開發如何成為技術達人

學Java編程還有前景嗎如何才能拿到高薪