首先,進入hadoop中的spark運算模塊開啟scala環境。
使用啟動命令:spark-shell [參數:--executor-memory 500m]
直接啟動本機spark方法一:
[root@yhfmaster ~]# spark-shell
啟動spark方法二:支持遠程啟動其它機器上的spark
[root@yhfmaster ~]# spark-shell --master spark://192.168.1.100 --executor-memory 500m
1.導入HiveContext方法
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
2.初始化
scala> val hiveContext = new HiveContext(sc)
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@b6db92b
3.選擇cf大數據庫
scala> hiveContext.sql("use yh_test")
res0: org.apache.spark.sql.DataFrame = [result: string]
4.讀取test數據表文件,將數據封裝到.rdd中並賦預給hitest變量集
scala> val hitetst = hiveContext.sql("select * from logistic_lineback_regression limit 10").rdd
hitetst: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[7] at rdd at <console>:34/<console>
5.從hitest變量集中取出5條數據
scala> hitetst.take(5).foreach(line => println("code:"+line(0)+";crdate:"+line(1)))
code:116695FZ;crdate:201705
code:116625FZ;crdate:201804
code:115914FZ;crdate:201711
code:115881CQ;crdate:201803
code:113954BJ;crdate:201709
閱讀更多 IT資源先生 的文章