编译安装 spark 1.6.1 +carbondata 1.1.0

Apache CarbonData是华为开发的并贡献给Apache基金会的开源项目。他是基于一系列列式存储、索引、压缩及编码技术而设计的大数据文件存储格式,其出现能够使PB级别的大数据量查询OLAP分析速度提升一个档次。具体详见官方文档。

编译安装 spark 1.6.1 +carbondata 1.1.0

1.编译

github下载carbondata 1.1.0 源码,

开始编译:

mvn clean package -DskipTests -Pwindows -Pspark-1.6 -Dspark.version=1.6.1 -Dhadoop.version=2.6.0

编译报错:

[ERROR] Failed to execute goal org.scala-tools:maven-scala-plugin:2.15.2:compile (default) on project carbondata-spark-common: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1(Exit value: 1) -> [Help 1]

[ERROR]

原因:因为spark版本和carbondata 不对应的导致的; 修改spark版本或者是换carbon的版本。

2.安装

参考官网

1. Build the CarbonData project and get the assembly jar from ./assembly/target/

scala-2.1x/carbondata_xxx.jar and copy to $SPARK_HOME/carbonlib folder.

NOTE: Create the carbonlib folder if it does not exists inside $SPARK_HOME path.

2. Copy the ./conf/carbon.properties.template file from CarbonData repository to

$SPARK_HOME/conf/ folder and rename the file to carbon.properties.

3. Create tar.gz file of carbonlib folder and move it inside the carbonlib folder.

cd $SPARK_HOME

tar -zcvf carbondata.tar.gz carbonlib/

mv carbondata.tar.gz carbonlib/

4.修改配置

4.1.与spark的相关配置没有配,直接加到spark-shell 的命令后面了。

4.2.Add the following properties in $SPARK_HOME/conf/carbon.properties

#Mandatory. Carbon Store path

carbon.storelocation=hdfs://nameservice1/carbondata/store

#Base directory for Data files

carbon.ddl.base.hdfs.url=hdfs://nameservice1/carbondata/data

#Path where the bad records are stored

carbon.badRecords.location=hdfs://nameservice1/carbondata/data-bad

3.案例测试:

3.1.使用spark-shell测试是否安装成功。

#执行shell脚本

spark-shell --master yarn-client \

--queue default \

--driver-memory 4g \

--num-executors 10 \

--executor-memory 12g \

--executor-cores 2 \

--conf spark.executor.extraJavaOptions="-XX:PermSize=128M -XX:MaxPermSize=128m -XX:+UseParallelOldGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \

--confspark.executor.extraJavaOptions="-Dcarbon.properties.filepath=/usr/local/spark/conf/carbon.properties" \

--confspark.driver.extraJavaOptions="-Dcarbon.properties.filepath=/usr/local/spark/conf/carbon.properties" \

--confspark.driver.extraClassPath="/usr/local/spark/carbonlib/*" \

--confspark.executor.extraClassPath="/usr/local/spark/carbonlib/*" \

--confspark.yarn.dist.files="/usr/local/spark/conf/carbon.properties" \

--confspark.yarn.dist.archives="/usr/local/spark/carbonlib/carbondata.tar.gz" \

--jars /usr/local/spark/carbonlib/carbondata_2.10-1.1.0-shade-hadoop2.6.0.jar,$HIVE_HOME/lib/mysql-connector-java-5.1.36.jar,$HIVE_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$HIVE_HOME/lib/datanucleus-core-3.2.10.jar,$HIVE_HOME/lib/datanucleus-rdbms-3.2.9.jar

import org.apache.spark.sql.CarbonContext 

创建cc的两种方式:

(1).#第一个目录是hdfs,第二个目录是本地目录 (元数据保存在本地文件)

scala> val cc = new CarbonContext(sc, "hdfs://nameservice1/carbondata_test", "/home/hadoop/carbondata_meta2")

(2).#使用默认配置,与hive集成把metadata存储在mysql中 (首先spark要与hive集成)

scala> val cc = new CarbonContext(sc, "hdfs://nameservice1/carbondata_test")

cc.sql("""CREATE TABLE IF NOT EXISTS test_table(

id string,

name string,

city string,

age Int)

STORED BY 'carbondata'""")

cc.sql("LOAD DATA INPATH 'hdfs://nameservice1/kyrie/sample.csv' INTO TABLE test_table")

sample.csv文件内容:

id,name,city,age

1,xiaojiang,beijing,18

2,dayue,beijing,20

3,xx,shanghai,22

scala> cc.sql("select * from test_table2").show(10)

+---+---------+--------+---+

| id| name| city|age|

+---+---------+--------+---+

| 1|xiaojiang| beijing| 18|

| 2| dayue| beijing| 20|

| 3| xx|shanghai| 22|

+---+---------+--------+---+

3.2.carbondata数据交互

CarbonData支持两种方式的数据导入,分别为:

- 直接通过CSV文件导入CarbonData表 (3.1中说过)

- 通过spark-sql API导入

#cc读取parquet文件

scala> val un = cc.read.parquet("/kyrie/unliver/gray/4/2/standard")

#写到carbondata

scala> un.write.format("carbondata").option("tableName", "asset_unilever").option("compress", "true").option("tempCSV", "false").save()

读取carbondata数据:

// use datasource api to readval in = cc.read.format("carbondata").option("tableName", "carbon1").load()


分享到:


相關文章: