Apache CarbonData是华为开发的并贡献给Apache基金会的开源项目。他是基于一系列列式存储、索引、压缩及编码技术而设计的大数据文件存储格式,其出现能够使PB级别的大数据量查询OLAP分析速度提升一个档次。具体详见官方文档。
1.编译
github下载carbondata 1.1.0 源码,
开始编译:
mvn clean package -DskipTests -Pwindows -Pspark-1.6 -Dspark.version=1.6.1 -Dhadoop.version=2.6.0
编译报错:
[ERROR] Failed to execute goal org.scala-tools:maven-scala-plugin:2.15.2:compile (default) on project carbondata-spark-common: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1(Exit value: 1) -> [Help 1]
[ERROR]
原因:因为spark版本和carbondata 不对应的导致的; 修改spark版本或者是换carbon的版本。
2.安装
参考官网
1. Build the CarbonData project and get the assembly jar from ./assembly/target/
scala-2.1x/carbondata_xxx.jar and copy to $SPARK_HOME/carbonlib folder.
NOTE: Create the carbonlib folder if it does not exists inside $SPARK_HOME path.
2. Copy the ./conf/carbon.properties.template file from CarbonData repository to
$SPARK_HOME/conf/ folder and rename the file to carbon.properties.
3. Create tar.gz file of carbonlib folder and move it inside the carbonlib folder.
cd $SPARK_HOME
tar -zcvf carbondata.tar.gz carbonlib/
mv carbondata.tar.gz carbonlib/
4.修改配置
4.1.与spark的相关配置没有配,直接加到spark-shell 的命令后面了。
4.2.Add the following properties in $SPARK_HOME/conf/carbon.properties
#Mandatory. Carbon Store path
carbon.storelocation=hdfs://nameservice1/carbondata/store
#Base directory for Data files
carbon.ddl.base.hdfs.url=hdfs://nameservice1/carbondata/data
#Path where the bad records are stored
carbon.badRecords.location=hdfs://nameservice1/carbondata/data-bad
3.案例测试:
3.1.使用spark-shell测试是否安装成功。
#执行shell脚本
spark-shell --master yarn-client \
--queue default \
--driver-memory 4g \
--num-executors 10 \
--executor-memory 12g \
--executor-cores 2 \
--conf spark.executor.extraJavaOptions="-XX:PermSize=128M -XX:MaxPermSize=128m -XX:+UseParallelOldGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
--confspark.executor.extraJavaOptions="-Dcarbon.properties.filepath=/usr/local/spark/conf/carbon.properties" \
--confspark.driver.extraJavaOptions="-Dcarbon.properties.filepath=/usr/local/spark/conf/carbon.properties" \
--confspark.driver.extraClassPath="/usr/local/spark/carbonlib/*" \
--confspark.executor.extraClassPath="/usr/local/spark/carbonlib/*" \
--confspark.yarn.dist.files="/usr/local/spark/conf/carbon.properties" \
--confspark.yarn.dist.archives="/usr/local/spark/carbonlib/carbondata.tar.gz" \
--jars /usr/local/spark/carbonlib/carbondata_2.10-1.1.0-shade-hadoop2.6.0.jar,$HIVE_HOME/lib/mysql-connector-java-5.1.36.jar,$HIVE_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$HIVE_HOME/lib/datanucleus-core-3.2.10.jar,$HIVE_HOME/lib/datanucleus-rdbms-3.2.9.jar
import org.apache.spark.sql.CarbonContext
创建cc的两种方式:
(1).#第一个目录是hdfs,第二个目录是本地目录 (元数据保存在本地文件)
scala> val cc = new CarbonContext(sc, "hdfs://nameservice1/carbondata_test", "/home/hadoop/carbondata_meta2")
(2).#使用默认配置,与hive集成把metadata存储在mysql中 (首先spark要与hive集成)
scala> val cc = new CarbonContext(sc, "hdfs://nameservice1/carbondata_test")
cc.sql("""CREATE TABLE IF NOT EXISTS test_table(
id string,
name string,
city string,
age Int)
STORED BY 'carbondata'""")
cc.sql("LOAD DATA INPATH 'hdfs://nameservice1/kyrie/sample.csv' INTO TABLE test_table")
sample.csv文件内容:
id,name,city,age
1,xiaojiang,beijing,18
2,dayue,beijing,20
3,xx,shanghai,22
scala> cc.sql("select * from test_table2").show(10)
+---+---------+--------+---+
| id| name| city|age|
+---+---------+--------+---+
| 1|xiaojiang| beijing| 18|
| 2| dayue| beijing| 20|
| 3| xx|shanghai| 22|
+---+---------+--------+---+
3.2.carbondata数据交互
CarbonData支持两种方式的数据导入,分别为:
- 直接通过CSV文件导入CarbonData表 (3.1中说过)
- 通过spark-sql API导入
#cc读取parquet文件
scala> val un = cc.read.parquet("/kyrie/unliver/gray/4/2/standard")
#写到carbondata
scala> un.write.format("carbondata").option("tableName", "asset_unilever").option("compress", "true").option("tempCSV", "false").save()
读取carbondata数据:
// use datasource api to readval in = cc.read.format("carbondata").option("tableName", "carbon1").load()
閱讀更多 從大數據說起 的文章