grep正則-dfs[a-z.]+

官網例子

https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/SingleCluster.html

<code>bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar grep input output 'dfs[a-z.]+'
/<code>

題幹:

  1. 實現正則匹配
  2. 統計正則匹配到的字符出現的個數

主要使用PHP語言來實現一套類似的規則,廢話不多說,看下面,實驗版本儘量保持一致

實驗版本

Hadoop

<code>Hadoop 2.10.0
Subversion ssh://git.corp.linkedin.com:29418/hadoop/hadoop.git -r e2f1f118e465e787d8567dfa6e2f3b72a0eb9194
Compiled by jhung on 2019-10-22T19:10Z
Compiled with protoc 2.5.0
From source with checksum 7b2d8877c5ce8c9a2cca5c7e81aa4026
This command was run using /usr/local/hadoop-2.10.0/share/hadoop/common/hadoop-common-2.10.0.jar
/<code>

PHP

<code>PHP 7.4.3 (cli) (built: Feb 23 2020 07:24:28) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
with Zend OPcache v7.4.3, Copyright (c), by Zend Technologies
/<code>
php操作hadoop - 實現官方例子 - grep正則-dfs[a-z.]+


準備數據源

將 /usr/local/hadoop-2.10.0/etc/hadoop 作為數據源,上傳到hdfs中去

<code># 如果hdfs中沒有input目錄,需要創建
hdfs dfs -mkdir input

hdfs dfs -put /usr/local/hadoop-2.10.0/etc/hadoop/*.xml input
/<code>

執行官方例子

<code>hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar grep input output 'dfs[a-z.]+'
/<code>

查看執行結果

<code>hdfs dfs -cat /user/root/output/*

1 dfsadmin
1 dfs.replication
1 dfs.permissions
1 dfs.namenode.secondary.http
1 dfs.http.address
/<code>

PHP 開發 MR 代碼

創建:map.php 腳本文件

<code>#!/usr/bin/php
$pattern = '/dfs[a-z.]+/';
while ($line = fgets(STDIN)) {
preg_match($pattern, $line, $matchs);
if ($matchs) {

echo $matchs[0] . PHP_EOL;
}
}
/<code>

創建:reducer.php 腳本文件

<code>#!/usr/bin/php
$result = [];
while ($line = fgets(STDIN)) {
// 此處分割是因為,map.php程序輸出每行自帶了回車符
$arr = explode(PHP_EOL, $line);

// 統計map.php中正則匹配到的單詞出現的個數
$key = $arr[0];
if (!isset($result[$key])) {
$result[$key] = 0;
}

$result[$key]++;
}

// 輸出
foreach ($result as $key => $value) {
echo "$value $key" . PHP_EOL;
}
/<code>

注意事項:

  1. 如果執行權限:chmod
  2. 注意windows回車換行符號

php MR代碼寫好後,別急著丟到hadoop-stream執行,先自行測試一下。

<code>cat /usr/local/hadoop-2.10.0/etc/hadoop/*.xml | ./map.php | ./reducer.php
/<code>

如果符合預期,則使用 hadoop-streaming

<code>hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.10.0.jar -mapper /mnt/d/workspaces/test/test1/map.php -reducer /mnt/d/workspaces/test/test1/reducer.php -input input -output output_5
/<code>

查看執行結果

<code>hdfs dfs -cat /user/root/output_5/*

1 dfs.http.address
1 dfs.namenode.secondary.http
1 dfs.permissions
1 dfs.replication
1 dfsadmin/<code>


分享到:


相關文章: