首先,进入hadoop中的spark运算模块开启scala环境。
使用启动命令:spark-shell [参数:--executor-memory 500m]
直接启动本机spark方法一:
[root@yhfmaster ~]# spark-shell
启动spark方法二:支持远程启动其它机器上的spark
[root@yhfmaster ~]# spark-shell --master spark://192.168.1.100 --executor-memory 500m
1.导入HiveContext方法
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
2.初始化
scala> val hiveContext = new HiveContext(sc)
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@b6db92b
3.选择cf大数据库
scala> hiveContext.sql("use yh_test")
res0: org.apache.spark.sql.DataFrame = [result: string]
4.读取test数据表文件,将数据封装到.rdd中并赋预给hitest变量集
scala> val hitetst = hiveContext.sql("select * from logistic_lineback_regression limit 10").rdd
hitetst: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[7] at rdd at <console>:34/<console>
5.从hitest变量集中取出5条数据
scala> hitetst.take(5).foreach(line => println("code:"+line(0)+";crdate:"+line(1)))
code:116695FZ;crdate:201705
code:116625FZ;crdate:201804
code:115914FZ;crdate:201711
code:115881CQ;crdate:201803
code:113954BJ;crdate:201709
閱讀更多 IT資源先生 的文章