Nutch-Hadoop-MongoDB搭建分佈式爬蟲

千里之行,始於足下.不積跬步,無以致千里


一、實現目標

使用Nutch、Hadoop、MongoDB實現一個簡單的分佈式爬蟲,在Hadoop上運行Nutch爬蟲抓取網頁,存儲到MongoDB中。

二、實驗環境

CentOS7 Linux x86_64

JDK 1.8.0_161

mongodb 2.6.12-6

hadoop 2.9.1

apache-ant-1.9.4

apache-nutch-2.3.1

三、安裝Oracle JDK

可參考《Nutch-MongoDB-ElasticSearch搭建搜索引擎》:https://www.toutiao.com/i6539542640034054663/

四 、安裝配置MongoDB

可參考《Nutch-MongoDB-ElasticSearch搭建搜索引擎》:https://www.toutiao.com/i6539542640034054663/

五、安裝配置Hadoop

Nutch-Hadoop-MongoDB搭建分佈式爬蟲

5.1下載Hadoop 2.9.1:http://hadoop.apache.org/releases.html

5.2 設置環境變量

/etc/profile:

export HADOOP_HOME=/home/vminger/workspace/sysapp/hadoop/hadoop-2.9.1

export PATH=$PATH:$HADOOP_HOME/bin

export PDSH_RCMD_TYPE=ssh

source /etc/profile

eval "$(ssh-agent -s)"

ssh-add

5.3 配置HDFS

etc/hadoop/core-site.xml:

Nutch-Hadoop-MongoDB搭建分佈式爬蟲

etc/hadoop/hdfs-site.xml:

Nutch-Hadoop-MongoDB搭建分佈式爬蟲

5.4 初始化HDFS

bin/hdfs namenode -format

5.5 啟動HDFS

sbin/start-dfs.sh

5.6 配置YARN

etc/hadoop/mapred-site.xml:

Nutch-Hadoop-MongoDB搭建分佈式爬蟲

etc/hadoop/yarn-site.xml:

Nutch-Hadoop-MongoDB搭建分佈式爬蟲

5.7 啟動YARN

sbin/start-yarn.sh

5.8 jps查看HDFS和YARN進程

Nutch-Hadoop-MongoDB搭建分佈式爬蟲

六、安裝配置Nutch

Nutch-Hadoop-MongoDB搭建分佈式爬蟲

6.1下載apache-ant-1.9.4-bin.tar.gz,並解壓,下載地址:

https://archive.apache.org/dist/ant/binaries/binaries/apache-ant-1.9.4-bin.tar.gz

6.2 設置ant環境變量

/etc/profile:

export ANT_HOME=/home/vminger/workspace/sysapp/ant/apache-ant-1.9.4

export PATH=$PATH:$ANT_HOME/bin

source /etc/profile

6.3 下載apache-nutch-2.3.1-src.tar.gz,並解壓,下載地址:

http://nutch.apache.org/downloads.html

6.4 設置nutch環境變量

/etc/profile:

export NUTCH_HOME=/home/vminger/workspace/sysapp/nutch/apache-nutch-2.3.1/runtime/local

export PATH=$PATH:$NUTCH_HOME/bin

source /etc/profile

6.5 配置nutch

conf/nutch-site.xml:

storage.data.store.class

org.apache.gora.mongodb.store.MongoStore

Default class for storing data

http.agent.name

Hist Crawler

ivy/ivy.xml:

conf/gora.properties:

gora.datastore.default=org.apache.gora.mongodb.store.MongoStore

gora.mongodb.override_hadoop_configuration=false

gora.mongodb.mapping.file=/gora-mongodb-mapping.xml

gora.mongodb.servers=vminger:27017

gora.mongodb.db=test1

gora.mongodb.login=root

gora.mongodb.secret=root

6.6 編譯Nutch:ant runtime

6.7 設置抓取URL過濾規則:

conf/regex-urlfilter.txt:

+^http://([a-z0-9]*\.)*sina.com.cn/

6.8 設置URL種子:

建立種子文件:urls/seed.ini

Nutch-Hadoop-MongoDB搭建分佈式爬蟲

將種子文件上傳到HDFS上:hadoop dfs -put urls urls

6.9 進入runtime/deploy目錄,開始抓取,id345,深度為3:

./bin/crawl urls id345 3

Nutch-Hadoop-MongoDB搭建分佈式爬蟲

6.10 mongodb中查看結果

db.id345_webpage.count();

Nutch-Hadoop-MongoDB搭建分佈式爬蟲

db.id345_webpage.find();

Nutch-Hadoop-MongoDB搭建分佈式爬蟲


分享到:


相關文章: