grep正则-dfs[a-z.]+

官网例子

https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/SingleCluster.html

<code>bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar grep input output 'dfs[a-z.]+'
/<code>

题干:

  1. 实现正则匹配
  2. 统计正则匹配到的字符出现的个数

主要使用PHP语言来实现一套类似的规则,废话不多说,看下面,实验版本尽量保持一致

实验版本

Hadoop

<code>Hadoop 2.10.0
Subversion ssh://git.corp.linkedin.com:29418/hadoop/hadoop.git -r e2f1f118e465e787d8567dfa6e2f3b72a0eb9194
Compiled by jhung on 2019-10-22T19:10Z
Compiled with protoc 2.5.0
From source with checksum 7b2d8877c5ce8c9a2cca5c7e81aa4026
This command was run using /usr/local/hadoop-2.10.0/share/hadoop/common/hadoop-common-2.10.0.jar
/<code>

PHP

<code>PHP 7.4.3 (cli) (built: Feb 23 2020 07:24:28) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
with Zend OPcache v7.4.3, Copyright (c), by Zend Technologies
/<code>
php操作hadoop - 实现官方例子 - grep正则-dfs[a-z.]+


准备数据源

将 /usr/local/hadoop-2.10.0/etc/hadoop 作为数据源,上传到hdfs中去

<code># 如果hdfs中没有input目录,需要创建
hdfs dfs -mkdir input

hdfs dfs -put /usr/local/hadoop-2.10.0/etc/hadoop/*.xml input
/<code>

执行官方例子

<code>hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar grep input output 'dfs[a-z.]+'
/<code>

查看执行结果

<code>hdfs dfs -cat /user/root/output/*

1 dfsadmin
1 dfs.replication
1 dfs.permissions
1 dfs.namenode.secondary.http
1 dfs.http.address
/<code>

PHP 开发 MR 代码

创建:map.php 脚本文件

<code>#!/usr/bin/php
$pattern = '/dfs[a-z.]+/';
while ($line = fgets(STDIN)) {
preg_match($pattern, $line, $matchs);
if ($matchs) {

echo $matchs[0] . PHP_EOL;
}
}
/<code>

创建:reducer.php 脚本文件

<code>#!/usr/bin/php
$result = [];
while ($line = fgets(STDIN)) {
// 此处分割是因为,map.php程序输出每行自带了回车符
$arr = explode(PHP_EOL, $line);

// 统计map.php中正则匹配到的单词出现的个数
$key = $arr[0];
if (!isset($result[$key])) {
$result[$key] = 0;
}

$result[$key]++;
}

// 输出
foreach ($result as $key => $value) {
echo "$value $key" . PHP_EOL;
}
/<code>

注意事项:

  1. 如果执行权限:chmod
  2. 注意windows回车换行符号

php MR代码写好后,别急着丢到hadoop-stream执行,先自行测试一下。

<code>cat /usr/local/hadoop-2.10.0/etc/hadoop/*.xml | ./map.php | ./reducer.php
/<code>

如果符合预期,则使用 hadoop-streaming

<code>hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.10.0.jar -mapper /mnt/d/workspaces/test/test1/map.php -reducer /mnt/d/workspaces/test/test1/reducer.php -input input -output output_5
/<code>

查看执行结果

<code>hdfs dfs -cat /user/root/output_5/*

1 dfs.http.address
1 dfs.namenode.secondary.http
1 dfs.permissions
1 dfs.replication
1 dfsadmin/<code>


分享到:


相關文章: