官網例子
https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/SingleCluster.html
<code>bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar grep input output 'dfs[a-z.]+'
/<code>
題幹:
- 實現正則匹配
- 統計正則匹配到的字符出現的個數
主要使用PHP語言來實現一套類似的規則,廢話不多說,看下面,實驗版本儘量保持一致
實驗版本
Hadoop
<code>Hadoop 2.10.0
Subversion ssh://git.corp.linkedin.com:29418/hadoop/hadoop.git -r e2f1f118e465e787d8567dfa6e2f3b72a0eb9194
Compiled by jhung on 2019-10-22T19:10Z
Compiled with protoc 2.5.0
From source with checksum 7b2d8877c5ce8c9a2cca5c7e81aa4026
This command was run using /usr/local/hadoop-2.10.0/share/hadoop/common/hadoop-common-2.10.0.jar
/<code>
PHP
<code>PHP 7.4.3 (cli) (built: Feb 23 2020 07:24:28) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
with Zend OPcache v7.4.3, Copyright (c), by Zend Technologies
/<code>
準備數據源
將 /usr/local/hadoop-2.10.0/etc/hadoop 作為數據源,上傳到hdfs中去
<code># 如果hdfs中沒有input目錄,需要創建
hdfs dfs -mkdir input
hdfs dfs -put /usr/local/hadoop-2.10.0/etc/hadoop/*.xml input
/<code>
執行官方例子
<code>hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar grep input output 'dfs[a-z.]+'
/<code>
查看執行結果
<code>hdfs dfs -cat /user/root/output/*
1 dfsadmin
1 dfs.replication
1 dfs.permissions
1 dfs.namenode.secondary.http
1 dfs.http.address
/<code>
PHP 開發 MR 代碼
創建:map.php 腳本文件
<code>#!/usr/bin/php
$pattern = '/dfs[a-z.]+/';
while ($line = fgets(STDIN)) {
preg_match($pattern, $line, $matchs);
if ($matchs) {
echo $matchs[0] . PHP_EOL;
}
}
/<code>
創建:reducer.php 腳本文件
<code>#!/usr/bin/php
$result = [];
while ($line = fgets(STDIN)) {
// 此處分割是因為,map.php程序輸出每行自帶了回車符
$arr = explode(PHP_EOL, $line);
// 統計map.php中正則匹配到的單詞出現的個數
$key = $arr[0];
if (!isset($result[$key])) {
$result[$key] = 0;
}
$result[$key]++;
}
// 輸出
foreach ($result as $key => $value) {
echo "$value $key" . PHP_EOL;
}
/<code>
注意事項:
- 如果執行權限:chmod
- 注意windows回車換行符號
php MR代碼寫好後,別急著丟到hadoop-stream執行,先自行測試一下。
<code>cat /usr/local/hadoop-2.10.0/etc/hadoop/*.xml | ./map.php | ./reducer.php
/<code>
如果符合預期,則使用 hadoop-streaming
<code>hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.10.0.jar -mapper /mnt/d/workspaces/test/test1/map.php -reducer /mnt/d/workspaces/test/test1/reducer.php -input input -output output_5
/<code>
查看執行結果
<code>hdfs dfs -cat /user/root/output_5/*
1 dfs.http.address
1 dfs.namenode.secondary.http
1 dfs.permissions
1 dfs.replication
1 dfsadmin/<code>
閱讀更多 有我在心 的文章