今天,我們將更深入地學習和實現8個頂級Python機器學習 算法。
讓我們開始Python編程中的機器學習算法之旅。
8 Python機器學習算法 - 你必須學習
以下是Python機器學習的算法:
1。線性迴歸
線性迴歸是受監督的Python機器學習算法之一,它可以觀察連續特徵並預測結果。根據它是在單個變量上還是在許多特徵上運行,我們可以將其稱為簡單線性迴歸或多元線性迴歸。
這是最受歡迎的Python ML算法之一,經常被低估。它為變量分配最佳權重以創建線ax + b來預測輸出。我們經常使用線性迴歸來估計實際值,例如基於連續變量的房屋調用和房屋成本。迴歸線是擬合Y = a * X + b的最佳線,表示獨立變量和因變量之間的關係。
您是否瞭解Python機器學習環境設置?
讓我們為糖尿病數據集繪製這個圖。
- >>>將matplotlib.pyplot導入為plt
- >>>將numpy導入為np
- >>>來自sklearn導入數據集,linear_model
- >>>來自sklearn.metrics import mean_squared_error,r2_score
- >>>糖尿病=數據集。load_diabetes ()
- >>> diabetes_X = diabetes.data [ :,np.newaxis,2 ]
- >>> diabetes_X_train = diabetes_X [ : - 30 ] #splitting數據到訓練和測試集
- >>> diabetes_X_test = diabetes_X [ - 30 :]
- >>> diabetes_y_train = diabetes.target [ : - 30 ] #splitting目標分為訓練和測試集
- >>> diabetes_y_test = diabetes.target [ - 30 :]
- >>> regr = linear_model。LinearRegression ()#線性迴歸對象
- >>> regr。fit (diabetes_X_train,diabetes_y_train )#Use training set訓練模型
LinearRegression(copy_X = True,fit_intercept = True,n_jobs = 1,normalize = False)
- >>> diabetes_y_pred = regr。預測(diabetes_X_test )#Make預測
- >>> regr.coef_
陣列([941.43097333])
- >>> mean_squared_error (diabetes_y_test,diabetes_y_pred )
3035.0601152912695
- >>> r2_score (diabetes_y_test,diabetes_y_pred )#Variance得分
0.410920728135835
- >>> plt。散射(diabetes_X_test,diabetes_y_test,color = 'lavender' )
- >>> plt。情節(diabetes_X_test,diabetes_y_pred,color = 'pink' ,linewidth = 3 )
[
- >>> plt。xticks (())
([],
- >>> plt。yticks (())
([],
- >>> plt。show ()
2 Logistic迴歸
Logistic迴歸是一種受監督的分類Python機器學習算法,可用於估計離散值,如0/1,是/否和真/假。這是基於一組給定的自變量。我們使用邏輯函數來預測事件的概率,這給出了0到1之間的輸出。
雖然它說'迴歸',但這實際上是一種分類算法。Logistic迴歸將數據擬合到logit函數中,也稱為logit迴歸。讓我們描繪一下。
- >>>將numpy導入為np
- >>>將matplotlib.pyplot導入為plt
- >>>來自sklearn import linear_model
- >>> XMIN,XMAX = - 7 ,7 #TEST集; 高斯噪聲的直線
- >>> n_samples = 77
- >>> np.random。種子(0 )
- >>> x = np.random。正常(size = n_samples )
- >>> y = (x> 0 )。astype (np.float )
- >>> x [ x> 0 ] * = 3
- >>> x + =。4 * np.random。正常(size = n_samples )
- >>> x = x [ :,np.newaxis ]
- >>> clf = linear_model。LogisticRegression (C = 1e4 )#Classifier
- >>> clf。適合(x,y )
- >>> plt。圖(1 ,figsize = (3 ,4 ))
- >>> plt。clf ()
- >>> plt。散射(X。拆紗()中,Y,顏色= '薰衣草' ,ZORDER = 17 )
- >>> x_test = np。linspace (- 7 ,7 ,277 )
- >>> def model (x ):
- 返回1 / (1個+ NP。EXP (-x ))
- >>> loss = model (x_test * clf.coef_ + clf.intercept_ )。拉威爾()
- >>> plt。 plot (x_test,loss,color = 'pink' ,linewidth = 2.5 )
[
- >>> ols = linear_model。LinearRegression ()
- >>> ols。適合(x,y )
LinearRegression(copy_X = True,fit_intercept = True,n_jobs = 1,normalize = False)
- >>> plt。plot (x_test,ols.coef_ * x_test + ols.intercept_,linewidth = 1 )
[
- >>> plt。axhline (。4 ,顏色= ” 0.4' )
- >>> plt。ylabel ('y' )
文本(0,0.5, 'Y')
- >>> plt。xlabel ('x' )
文本(0.5,0, 'X')
- >>> plt。xticks (範圍(- 7 ,7 ))
- >>> plt。yticks ([ 0 ,0.4 ,1 ] )
- >>> plt。ylim (- 。25 ,1.25 )
(-0.25,1.25)
- >>> plt。XLIM (- 4 ,10 )
(-4,10)
- >>> plt。圖例(('Logistic迴歸' ,'線性迴歸' ),loc = '右下' ,fontsize = 'small' )
- >>> plt。show ()
3。決策樹
決策樹屬於受監督的Python機器學習學習,並且用於分類和迴歸 - 儘管主要用於分類。此模型接受一個實例,遍歷樹,並將重要特徵與確定的條件語句進行比較。是下降到左子分支還是右分支取決於結果。通常,更重要的功能更接近根。
這種Python機器學習算法可以對分類和連續因變量起作用。在這裡,我們將人口分成兩個或更多個同類集。讓我們看看這個算法 -
- >>>來自sklearn.cross_validation import train_test_split
- >>>來自sklearn.tree導入DecisionTreeClassifier
- >>>來自sklearn.metrics import accuracy_score
- >>>來自sklearn.metrics import classification_report
- >>> def importdata ():#Importing data
- balance_data = PD。read_csv ( 'https://archive.ics.uci.edu/ml/machine-learning-' +
- 'databases / balance-scale / balance-scale.data' ,
- sep = ',' ,header = None )
- print (len (balance_data ))
- print (balance_data.shape )
- 打印(balance_data。頭())
- return balance_data
- >>> def splitdataset (balance_data ):# Splitting 數據
- x = balance_data.values [ :,1 :5 ]
- y = balance_data.values [ :,0 ]
- x_train,x_test,y_train,y_test = train_test_split (
- x,y,test_size = 0.3 ,random_state = 100 )
- 返回x,y,x_train,x_test,y_train,y_test
- >>> def train_using_gini (x_train,x_test,y_train ):#gining with giniIndex
- clf_gini = DecisionTreeClassifier (criterion = “ gini ” ,
- random_state = 100 ,max_depth = 3 ,min_samples_leaf = 5 )
- clf_gini。適合(x_train,y_train )
- 返回clf_gini
- >>> def train_using_entropy (x_train,x_test,y_train ):#Training with entropy
- clf_entropy = DecisionTreeClassifier (
- criterion = “entropy” ,random_state = 100 ,
- max_depth = 3 ,min_samples_leaf = 5 )
- clf_entropy。適合(x_train,y_train )
- 返回clf_entropy
- >>> def 預測 (x_test,clf_object ):#製作預測
- y_pred = clf_object。預測(x_test )
- print (f “預測值:{y_pred}” )
- 返回y_pred
- >>> def cal_accuracy (y_test,y_pred ):#計算準確性
- print (confusion_matrix (y_test,y_pred ))
- 打印(accuracy_score (y_test,y_pred )* 100 )
- print (classification_report (y_test,y_pred ))
- >>> data = importdata ()
625
(625,5)
0 1 2 3 4
0 B 1 1 1 1
1 R 1 1 1 2
2 R 1 1 1 3
3 R 1 1 1 4
4 R 1 1 1 5
- >>> x,y,x_train,x_test,y_train,y_test = splitdataset (data )
- >>> clf_gini = train_using_gini (x_train,x_test,y_train )
- >>> clf_entropy = train_using_entropy (x_train,x_test,y_train )
- >>> y_pred_gini = 預測(x_test,clf_gini )
- >>> cal_accuracy (y_test,y_pred_gini )
[[0 6 7]
[0 67 18]
[0 19 71]]
73.40425531914893
- >>> y_pred_entropy = 預測(x_test,clf_entropy )
- >>> cal_accuracy (y_test,y_pred_entropy )
[[0 6 7]
[0 63 22]
[0 20 70]]
70.74468085106383
4。支持向量機(SVM)
SVM是一種受監督的分類Python機器學習算法,它繪製了一條劃分不同類別數據的線。在這個ML算法中,我們計算向量以優化線。這是為了確保每組中最近的點彼此相距最遠。雖然你幾乎總會發現這是一個線性向量,但它可能不是那樣的。
在這個Python機器學習教程中,我們將每個數據項繪製為n維空間中的一個點。我們有n個特徵,每個特徵都具有某個座標的值。
首先,讓我們繪製一個數據集。
- >>>來自sklearn.datasets.samples_generator import make_blobs
- >>> x,y = make_blobs (n_samples = 500 ,centers = 2 ,
- random_state = 0 ,cluster_std = 0 .40 )
- >>>將matplotlib.pyplot導入為plt
- >>> plt。scatter (x [ :,0 ] ,x [ :,1 ] ,c = y,s = 50 ,cmap = 'plasma' )
位於0x04E1BBF0的
- >>> plt。show ()
- >>>將numpy導入為np
- >>> xfit = np。linspace (- 1 ,3 0.5 )
- >>> plt。scatter (X [ :,0 ] ,X [ :,1 ] ,c = Y,s = 50 ,cmap = 'plasma' )
- >>>為M,B,d在[ (1 ,0.65 ,0.33 ),(0.5 ,1.6 ,0.55 ),(- 0 0.2 ,2 0.9 ,0.2 )] :
- yfit = m * xfit + b
- PLT。情節(xfit,yfit,' - k' )
- PLT。fill_between (xfit ,yfit - d,yfit + d,edgecolor = 'none' ,
- color = '#AFFEDC' ,alpha = 0.4 )
[
[
[
- >>> plt。XLIM (- 1 ,3.5 )
(-1,3.5)
- >>> plt。show ()
5, 樸素貝葉斯
樸素貝葉斯是一種基於貝葉斯定理的分類方法。這假定預測變量之間的獨立性。樸素貝葉斯分類器將假定類中的特徵與任何其他特徵無關。考慮一個水果。這是一個蘋果,如果它是圓形,紅色,直徑2.5英寸。樸素貝葉斯分類器將說這些特徵獨立地促成果實成為蘋果的概率。即使功能相互依賴,這也是如此。
對於非常大的數據集,很容易構建樸素貝葉斯模型。這種模型不僅非常簡單,而且比許多高度複雜的分類方法表現更好。讓我們建立這個。
- >>>來自sklearn.naive_bayes導入GaussianNB
- >>>來自sklearn.naive_bayes導入MultinomialNB
- >>>來自sklearn導入數據集
- >>>來自sklearn.metrics import confusion_matrix
- >>>來自sklearn.model_selection import train_test_split
- >>> iris =數據集。load_iris ()
- >>> x = iris.data
- >>> y = iris.target
- >>> x_train,x_test,y_train,y_test = train_test_split (x,y,test_size = 0 .3 ,random_state = 0 )
- >>> gnb = GaussianNB ()
- >>> MNB = MultinomialNB ()
- >>> y_pred_gnb = gnb。適合(x_train,y_train )。預測(x_test )
- >>> cnf_matrix_gnb = confusion_matrix (y_test,y_pred_gnb )
- >>> cnf_matrix_gnb
數組([[16,0,0],
[0,18,0],
[0,0,11]],dtype = int64)
- >>> y_pred_mnb = mnb。適合(x_train,y_train )。預測(x_test )
- >>> cnf_matrix_mnb = confusion_matrix (y_test,y_pred_mnb )
- >>> cnf_matrix_mnb
數組([[16,0,0],
[0,0,18],
[0,0,11]],dtype = int64)
6。kNN(k-Nearest Neighbors)
這是一種用於分類和迴歸的Python機器學習算法 - 主要用於分類。這是一種監督學習算法,它考慮不同的質心並使用通常的歐幾里德函數來比較距離。然後,它分析結果並將每個點分類到組以優化它以放置所有最接近的點。它使用其鄰居k的多數票對新案件進行分類。它分配給一個類的情況是其K個最近鄰居中最常見的一個。為此,它使用距離函數。
I,對整個數據集進行培訓和測試
- >>>來自sklearn.datasets import load_iris
- >>> iris = load_iris ()
- >>> x = iris.data
- >>> y = iris.target
- >>>來自sklearn.linear_model import LogisticRegression
- >>> logreg = LogisticRegression ()
- >>> logreg。適合(x,y )
LogisticRegression(C = 1.0,class_weight = None,dual = False,fit_intercept = True,
intercept_scaling = 1,max_iter = 100,multi_class ='ovr',n_jobs = 1,
penalty ='l2',random_state = None,solver ='liblinear',tol = 0.0001,
verbose = 0,warm_start = False)
- >>> logreg。預測(x )
array([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
2,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]]
- >>> y_pred = logreg。預測(x )
- >>> len (y_pred )
150
- >>>來自sklearn導入指標
- >>>指標。accuracy_score (y,y_pred )
0.96
- >>>來自sklearn.neighbors導入KNeighborsClassifier
- >>> knn = KNeighborsClassifier (n_neighbors = 5 )
- >>> knn。適合(x,y )
KNeighborsClassifier(algorithm ='auto',leaf_size = 30,metric ='minkowski',
metric_params =無,n_jobs = 1,n_neighbors = 5,p = 2,
權重=“均勻”)
- >>> y_pred = knn。預測(x )
- >>>指標。accuracy_score (y,y_pred )
0.9666666666666667
- >>> knn = KNeighborsClassifier (n_neighbors = 1 )
- >>> knn。適合(x,y )
KNeighborsClassifier(algorithm ='auto',leaf_size = 30,metric ='minkowski',
metric_params =無,n_jobs = 1,n_neighbors = 1,p = 2,
權重=“均勻”)
- >>> y_pred = knn。預測(x )
- >>>指標。accuracy_score (y,y_pred )
1.0
II。分裂成火車/測試
- >>> x.shape
(150,4)
- >>> y.shape
(150)
- >>>來自sklearn.cross_validation import train_test_split
- >>> x.shape
(150,4)
- >>> y.shape
(150)
- >>>來自sklearn.cross_validation import train_test_split
- >>> x_train,x_test,y_train,y_test = train_test_split (x,y,test_size = 0.4 ,random_state = 4 )
- >>> x_train.shape
(90,4)
- >>> x_test.shape
(60,4)
- >>> y_train.shape
(90)
- >>> y_test.shape
(60)
- >>> logreg = LogisticRegression ()
- >>> logreg。適合(x_train,y_train )
- >>> y_pred = knn。預測(x_test )
- >>>指標。accuracy_score (y_test,y_pred )
0.9666666666666667
- >>> knn = KNeighborsClassifier (n_neighbors = 5 )
- >>> knn。適合(x_train,y_train )
KNeighborsClassifier(algorithm ='auto',leaf_size = 30,metric ='minkowski',
metric_params =無,n_jobs = 1,n_neighbors = 5,p = 2,
權重=“均勻”)
- >>> y_pred = knn。預測(x_test )
- >>>指標。accuracy_score (y_test,y_pred )
0.9666666666666667
- >>> k_range = 範圍(1 ,26 )
- >>>得分= [ ]
- >>> for k in k_range:
- knn = KNeighborsClassifier (n_neighbors = k )
- KNN。適合(x_train,y_train )
- y_pred = knn。預測(x_test )
- 分數。追加(指標。 accuracy_score (y_test,y_pred ))
- >>>分數
[0.95,0.95,0.9666666666666667,0.9666666666666667,0.9666666666666667,0.9833333333333333,0.9833333333333333,0.9833333333333333,0.9833333333333333,0.9833333333333333,0.9833333333333333,0.9833333333333333,0.9833333333333333,0.9833333333333333,0.9833333333333333,0.9833333333333333,0.9833333333333333,0.9666666666666667,0.9833333333333333,0.9666666666666667,0.9666666666666667,0.9666666666666667,0.9666666666666667 0.95,0.95 ]
- >>>將matplotlib.pyplot導入為plt
- >>> plt。情節(k_range,分數)
[
- >>> plt。xlabel ('k代表kNN' )
文字(0.5,0,'k為kNN')
- >>> plt。ylabel ('測試準確度' )
文字(0,0.5,'測試準確度')
- >>> plt。show ()
閱讀Python統計數據 - p值,相關性,T檢驗,KS檢驗
7。K-Means
k-Means是一種無監督算法,可以解決聚類問題。它使用許多集群對數據進行分類。類中的數據點與同類組是同構的和異構的。
- >>>將numpy導入為np
- >>>將matplotlib.pyplot導入為plt
- >>>來自matplotlib導入樣式
- >>>風格。使用('ggplot' )
- >>>來自sklearn.cluster導入KMeans
- >>> X = [ 1 ,5 ,1 0.5 ,8 ,1 ,9 ]
- >>> Y = [ 2 ,8 ,1.7 ,6 ,0 0.2 ,12 ]
- >>> plt。散射(x,y )
- >>> x = np。陣列([ [ 1 ,2 ] ,[ 5 ,8 ] ,[ 1.5 ,1 0.8 ] ,[ 8 ,8 ] ,[ 1 ,0 0.6 ] ,[ 9 ,11 ] ] )
- >>> kmeans = KMeans (n_clusters = 2 )
- >>> kmeans。適合(x )
KMeans(algorithm ='auto',copy_x = True,init ='k-means ++',max_iter = 300,
n_clusters = 2,n_init = 10,n_jobs = 1,precompute_distances ='auto',
random_state =無,tol = 0.0001,verbose = 0)
- >>> centroids = kmeans.cluster_centers_
- >>> labels = kmeans.labels_
- >>>質心
數組([[1.16666667,1.46666667],
[7.33333333,9。]])
- >>>標籤
數組([0,1,0,1,0,1])
- >>> colors = [ 'g。' ,'r。' ,'c。' ,'呃。' ]
- >>> for i in range (len (x )):
- print (x [ i ] ,labels [ i ] )
- PLT。plot (x [ i ] [ 0 ] ,x [ i ] [ 1 ] ,colors [ labels [ i ] ] ,markersize = 10 )
[1。2.] 0
[
[5。8.] 1
[
[1.5 1.8] 0
[
[8。8.] 1
[
[1。0.6] 0
[
[9. 11.] 1
[
- >>> plt。scatter (centroids [ :,0 ] ,centroids [ :,1 ] ,marker = 'x' ,s = 150 ,linewidths = 5 ,zorder = 10 )
- >>> plt。show ()
8。Random Forest
Random Forest是決策樹的集合。為了根據其屬性對每個新對象進行分類,樹投票給類 - 每個樹提供一個分類。投票最多的分類在Random
中獲勝。
- >>>將numpy導入為np
- >>>將pylab導入為pl
- >>> x = np.random。均勻的(1 ,100 ,1000 )
- >>> y = np。log (x )+ np.random。正常(0 ,。3 ,1000 )
- >>> pl。scatter (x,y,s = 1 ,label = 'log(x)with noise' )
- >>> pl。情節(NP。人氣指數(1 ,100 ),NP。日誌(NP。人氣指數(1 ,100 ))中,c = 'B' ,標記= '日誌(x)的函數真' )
[
- >>> pl。xlabel ('x' )
文本(0.5,0, 'X')
- >>> pl。ylabel ('f(x)= log(x)' )
文本(0,0.5, 'F(X)=日誌(X)')
- >>> pl。傳奇(loc = 'best' )
- >>> pl。標題('基本日誌功能' )
文字(0.5,1,'基本日誌功能')
- >>> pl。show ()
- >>>來自sklearn.datasets import load_iris
- >>>來自sklearn.ensemble導入RandomForestClassifier
- >>>將pandas導入為pd
- >>>將numpy導入為np
- >>> iris = load_iris ()
- >>> df = pd。DataFrame (iris.data,columns = iris.feature_names )
- >>> df [ 'is_train' ] = np.random。均勻的(0 ,1 ,LEN (DF ))<=。75
- >>> df [ 'species' ] = pd.Categorical。from_codes (iris.target,iris.target_names )
- >>> df。頭()
萼片長度(釐米)萼片寬度(釐米)... is_train物種
0 5.1 3.5 ...真正的setosa
1 4.9 3.0 ...真正的setosa
2 4.7 3.2 ...真正的setosa
3 4.6 3.1 ...真正的setosa
4 5.0 3.6 ...假setosa
[5行x 6列]
- >>> train,test = df [ df [ 'is_train' ] == True ] ,df [ df [ 'is_train' ] == False ]
- >>> features = df.columns [ :4 ]
- >>> clf = RandomForestClassifier (n_jobs = 2 )
- >>> y,_ = pd。factorize (train [ 'species' ] )
- >>> clf。適合(火車[ 功能] ,y )
RandomForestClassifier(bootstrap = True,class_weight = None,criterion ='gini',
max_depth =無,max_features ='auto',max_leaf_nodes =無,
min_impurity_decrease = 0.0,min_impurity_split =無,
min_samples_leaf = 1,min_samples_split = 2,
min_weight_fraction_leaf = 0.0,n_estimators = 10,n_jobs = 2,
oob_score = False,random_state = None,verbose = 0,
warm_start = FALSE)
- >>> preds = iris.target_names [ clf。預測(測試[ 特徵] )]
- >>> pd。交叉表(test [ 'species' ] ,preds,rownames = [ 'actual' ] ,colnames = [ 'preds' ] )
preds setosa versicolor virginica
實際
setosa 12 0 0
versicolor 0 17 2
virginica 0 1 15
所以,這就是Python機器學習算法教程。希望你喜歡。
因此,今天我們討論了八個重要的Python機器學習算法。您認為哪一個最具潛力?希望大家多多關注,更多精彩的文章帶給大家!
閱讀更多 大數據信息站 的文章