機器學習:樸素貝葉斯分類器的簡單實現

在本文中,我將討論Naive Bayes分類器的簡單實現,以預測患者是否患有糖尿病。為此,我使用機器學習數據集:“Pima Indians糖尿病數據庫”(https://www.kaggle.com/uciml/pima-indians-diabetes-database);

我們一如既往地開始加載我們將要使用的所有Python庫:

%matplotlib inline 
import matplotlib.pyplot as plt
plt.style.use('classic')
import numpy as np
import pandas as pd
from numpy import *
import random
import math
from IPython.display import display
pi = math.pi
機器學習:樸素貝葉斯分類器的簡單實現

探索性分析

我們首先加載數據集文件並查看數據:用於做出預測的特徵、大小以及是否缺少數據。

full_catalog = pd.read_csv('/home/ealmaraz/dscience/sandbox/datasets/unzip/diabetes.csv')
print (full_catalog.columns)
print("Size of the catalogue: {}".format(len(full_catalog)))
print("Is there any NaN?: {}".format(full_catalog.isnull().any().any()))
full_catalog.head()
機器學習:樸素貝葉斯分類器的簡單實現

positive = full_catalog[full_catalog['Outcome'] == 1]
print('number of patients with diabetes: ',len(positive))
negative = full_catalog[full_catalog['Outcome'] == 0]
print('number of healthy patients: ',len(negative))
機器學習:樸素貝葉斯分類器的簡單實現

number of patients with diabetes: 268

number of healthy patients: 500

我們來看看是否有一些特徵可以用來解釋數據。

a)藍色代表健康,紅色代表疾病

#according to the color map, blue-> 0 (negative) & red-> 1(positive)
df = pd.DataFrame(full_catalog, columns=full_catalog.columns.drop('Outcome'))
pd.plotting.scatter_matrix(df,c=full_catalog['Outcome'].values,figsize=(15, 15),marker='o',
hist_kwds={'bins': 10,'color':'green'}, s=10, alpha=.2,cmap=plt.get_cmap('bwr'));
機器學習:樸素貝葉斯分類器的簡單實現

機器學習:樸素貝葉斯分類器的簡單實現

b)糖尿病患者

df = pd.DataFrame(positive, columns=positive.columns.drop('Outcome'))
pd.plotting.scatter_matrix(df,c='red',figsize=(15, 15),marker='o',hist_kwds={'bins': 10,'color':'red'}, s=10, alpha=.2);
機器學習:樸素貝葉斯分類器的簡單實現

機器學習:樸素貝葉斯分類器的簡單實現

c)健康的

df = pd.DataFrame(negative, columns=negative.columns.drop('Outcome'))
pd.plotting.scatter_matrix(df,c='blue',figsize=(15, 15),marker='o',hist_kwds={'bins': 10,'color':'blue'}, s=10, alpha=.2);
機器學習:樸素貝葉斯分類器的簡單實現

機器學習:樸素貝葉斯分類器的簡單實現

樸素貝葉斯分類器函數

create_training_test函數將整個數據集分為訓練集和測試集。參數:

  • dataset:要分析的數據集
  • fraction_training:與訓練集相對應的數據集的部分(0到1之間)
  • msg:調試標記。如果此標誌為真,程序將顯示當前正在做什麼的信息

輸出:

  • training_set:包含訓練集的數據框
  • test_set:包含測試集的數據框架名
def create_training_test(dataset,fraction_training,msg):

size_dataset=len(dataset); size_training=round(size_dataset*fraction_training); size_test=size_dataset-size_training

#initially, both the training set and the test sets are made from the whole dataset
training_set = dataset.copy(); test_set = dataset.copy()
#index of the dataset dataframe
total_idx_list = list(dataset.index.values)

#index of the test set. We use random.sample to pick out non-repeated integers in the dataset.index.values array
test_idx_list = random.sample(list(dataset.index.values),size_test)
test_idx_list.sort()

#index of the training set. This is simply the difference between total_idx_list and test_idx_list
training_idx_list = list(set(total_idx_list)-set(test_idx_list))
#once we have the two lists, we drop the corresponding rows from the training and the test dataframes
training_set.drop(training_set.index[test_idx_list],inplace=True)
test_set.drop(test_set.index[training_idx_list],inplace=True)



if msg == True:
training_positive = training_set[training_set['Outcome']==1]
training_negative = training_set[training_set['Outcome']==0]
print("size of the dataset : {} samples".format(size_dataset))
print('size of the training set : {} samples ({} of the whole dataset)'.format(len(training_set),fraction_training))
print('\tpositive cases in the training set: {}'.format(len(training_positive)))
print('\tnegative cases in the training set: {}'.format(len(training_negative)))
print('size of the test set : {} samples'.format(len(test_set)))

return training_set,test_set
機器學習:樸素貝葉斯分類器的簡單實現

dict_parameters函數創建一個字典,存儲每個特性的平均值和標準差。函數的參數為:

  • dataset:要分析的dataset frame
  • msg:調試標記。

輸出:

  • dict_parameters:包含每個特徵的平均值和標準差的字典,例如:{'Pregnancies':(3.02,0.23),'Age':(25.34,3.2),...}
def get_parameters(dataset,msg):
features = dataset.columns.values; nbins = 10; dict_parameters = {}

#we are excluding 'Outcome' from the loop
for i in range(0,len(features)-1):
#we single out the column 'features[i]' from dataset
aux_df = pd.DataFrame(dataset[features[i]])
#here we make the partition into nbins. aux_df has an extra column indicating
#to which bin each instance belongs to
aux_df['bin'] = pd.cut(aux_df[features[i]],nbins)
#'counts' is a series whose index is the bin interval and the values are the number
#of counts in each bin.
counts = pd.value_counts(aux_df['bin'])

points_X = np.zeros(nbins)
points_Y = np.zeros(nbins)

for j in range(0,nbins):
points_X[j] = counts.index[j].mid #the mid point of each bin
points_Y[j] = counts.iloc[j] #the number of counts
total_Y = np.sum(points_Y)

#we compute the mean and the standard deviation. The results are stored in the dictionary 'dict_parameters'
#whose keys are the labels of the columns and the values are (mu,sigma)
mu = np.sum(points_X*points_Y)/total_Y
sigma2 = np.sum((points_X-mu)**2*points_Y)/(total_Y-1); sigma = math.sqrt(sigma2)
dict_parameters[features[i]]=(mu,sigma)

if msg == True:
print('\t\tfeature: {}, mean: {}, standard deviation: {}'.format(features[i],mu,sigma))

return dict_parameters
機器學習:樸素貝葉斯分類器的簡單實現

這個函數計算每個特徵的概率密度函數。函數的參數為:

  • instance:pandas序列,其索引是特徵,其值只是每個特徵的度量
  • dictionary:包含平均值和標準差的字典,用來評估實例中每個特徵的概率密度函數

輸出:

  • dict_likelihood:具有條件概率的字典

對於每個特徵

機器學習:樸素貝葉斯分類器的簡單實現

在探索性分析的基礎上,我們使用了指數分佈(Pregnancies, Insulin, Diabetes Pedigree Function 和 Age)

機器學習:樸素貝葉斯分類器的簡單實現

對於Glucose, Blood Pressure, Skin Thickness 和 BMI,我們採用高斯分佈:

機器學習:樸素貝葉斯分類器的簡單實現

嚴格地說,這些是密度概率分佈而不是概率。為了得到概率,我們必須P(x)乘以dx。實際上,P(x)可能比1大,但是當我們乘以dx,結果一定總是小於1。然而,這並不重要,因為這些$dx$因子在result =1和result =0中都是相同的,所以它們在Baye定理中是相同的乘法因子,因此在判斷P(result =1|features)是否大於P(result =0|features)時沒有影響。

def likelihood(instance,dictionary):
instance = instance[instance.index != 'Outcome']
dict_likelihood = {}
for feature in instance.index:
mu = dictionary[feature][0]; sigma = dictionary[feature][1]
measurement = instance[feature]
if feature in ['Pregnancies','Insulin','DiabetesPedigreeFunction','Age']:
dict_likelihood[feature] = 1./mu*math.exp(-measurement/mu)
elif feature in ['Glucose','BloodPressure','SkinThickness','BMI']:
dict_likelihood[feature] = 1./(math.sqrt(2.*pi)*sigma)*math.exp(-(measurement-mu)**2/(2.*sigma**2))
return dict_likelihood
機器學習:樸素貝葉斯分類器的簡單實現

該bayes函數實現了貝葉斯定理對實例進行分類。輸入:

  • lkh_positive:對於每個特徵,條件概率P(features|outcome=1)的字典
  • lkh_negative:對於每個特徵,條件概率P(features|outcome=0)的字典
  • prob_positive:找到病人的概率

輸出是分類器預測:1(患者患病),0(患者未患病)。

機器學習:樸素貝葉斯分類器的簡單實現

def bayes(lkh_positive,lkh_negative,prob_positive):
logPositive = 0; logNegative = 0

for feature in lkh_positive:
logPositive += math.log(lkh_positive[feature]); logNegative += math.log(lkh_negative[feature])

logPositive = logPositive + math.log(prob_positive); logNegative = logNegative + math.log(1.-prob_positive)

if logPositive > logNegative:
return 1
else:
return 0
機器學習:樸素貝葉斯分類器的簡單實現

分類器管理程序

def pima_indians_NBClassifier(training_fraction,msg): 

#we import the catalog
dataset = pd.read_csv('/home/ealmaraz/dscience/sandbox/datasets/unzip/diabetes.csv')

#here we create the training and the test sets
training,test=create_training_test(dataset,training_fraction,msg)

#we split the training set into positive (1) and negative (0) values of 'Outcome'
training_positive = training[training['Outcome']==1]; training_negative = training[training['Outcome']==0]
prob_positive = len(training_positive)/(len(training))

#we get the parameters for the positive (negative) subsamples in the training set
if msg == True:
print('getting the parameters for the training set...')
print('\tpositive cases subsample')
param_positive = get_parameters(training_positive,msg)

if msg == True:
print('\tnegative cases subsample')

param_negative = get_parameters(training_negative,msg)

if msg == True:
print('\tprobability of finding a positive case: {}'.format(prob_positive))
print('analizing the test set...')
#here we compute the accuracy of the classifier by looping over the instances of the test set
error_count = 0

for idx in test.index.values:
instance = test.loc[idx]
likelihood_positive = likelihood(instance,param_positive)
likelihood_negative = likelihood(instance,param_negative)
prediction = bayes(likelihood_positive,likelihood_negative,prob_positive)
answer = int(instance['Outcome'])

if prediction != answer:
error_count += 1
if msg == True:
print('\tclassification error!')

error_rate = float(error_count)/len(test)

if msg == True:
print('Results for this implementation:')
print('\terror rate : {}'.format(error_rate))
print('\tsuccessfull classification rate : {}'.format(1.-error_rate))
return error_rate
機器學習:樸素貝葉斯分類器的簡單實現

分類器的性能

a) Single implementation。在這裡,我們通過運行分類器的 Single implementation來顯示結果

training_fraction = 0.75; msg = True
pima_indians_NBClassifier(training_fraction,msg)
機器學習:樸素貝葉斯分類器的簡單實現

b) Multiple implementation. 為了計算分類器的精度,我們需要多次運行它,並取所有實現的平均值。

training_fraction = 0.75; nrealizations = 500; msg = False
error_rate = np.zeros(nrealizations)

success_rate = np.zeros(nrealizations)
for i in range(0,nrealizations):
aux = pima_indians_NBClassifier(training_fraction,msg)
error_rate[i] = aux
success_rate[i] = 1.-aux
print('Results after {} realizations and training the classifier with {} of the whole sample...'.format(nrealizations,training_fraction))
print('error rate mean: {}, std {}'.format(np.mean(error_rate),np.std(error_rate)))
print('successfull rate mean: {}, std {}'.format(np.mean(success_rate),np.std(success_rate)))
機器學習:樸素貝葉斯分類器的簡單實現

機器學習:樸素貝葉斯分類器的簡單實現


分享到:


相關文章: