使用scikit-learn解決文本多分類問題(附python演練)

使用scikit-learn解決文本多分類問題(附python演練)

來源 | TowardsDataScience

譯者 | Revolver

在我們的商業世界中,存在著許多需要對文本進行分類的情況。例如,新聞報道通常按主題進行組織; 內容或產品通常需要按類別打上標籤; 根據用戶在線上談論產品或品牌時的文字內容將用戶分到不同的群組......

但是,互聯網上的絕大多數文本分類文章和教程都是二文本分類,如垃圾郵件過濾(垃圾郵件與正常郵件),情感分析(正面與負面)。在大多數情況下,我們的現實世界問題要複雜得多。因此,這就是我們今天要做的事情:將消費者在金融方面的投訴分為12個事先定義好的類別。數據可以從data.gov(https://catalog.data.gov/dataset/consumer-complaint-database)下載。

我們使用Python和Jupyter Notebook來開發我們的系統,並用到了Scikit-Learn中的機器學習組件。如果您想看到在PySpark (https://medium.com/@actsusanli/multi-class-text-classification-with-pyspark-7d78d022ed35)上的實現,請閱讀下一篇文章。

一、問題描述

我們的問題是是文本分類的有監督問題,我們的目標是調查哪種監督機器學習方法最適合解決它。

如果來了一條新的投訴,我們希望將其分配到12個類別中的一個。分類器假設每條新投訴都分配給一個且僅一個類別。這是文本多分類問題。是不是很迫不及待想看到我們可以做到什麼程度呢!

二、數據探索

在深入研究機器學習模型之前,我們首先應該觀察一下部分數據,看看每個類別下的投訴都是什麼樣兒?

import pandas as pd
df = pd.read_csv('Consumer_Complaints.csv')
df.head()


使用scikit-learn解決文本多分類問題(附python演練)


對於這個項目,我們其實只需要關注兩列數據 - “Product”和“ Consumer complaint narrative ”(消費者投訴敘述)。

輸入:Consumer_complaint_narrative

示例:“ I have outdated information on my credit report that I have previously disputed that has yet to be removed this information is more then seven years old and does not meet credit reporting requirements”

(“我的信用報告中存在過時信息,我之前已經提到過但還是沒被刪除, 此信息存在達七年之久,這並不符合信用報告要求”)

輸出:Product

示例:Credit reporting (信用報告)

我們將移除“Consumer_complaint_narrative”這列中含缺失值的記錄,並添加一列將Product編碼為整數的列,因為分類標籤通常更適合用整數表示而非字符串。

我們還創建了幾個字典對象保存類標籤和Product的映射關係,供將來使用。

清洗完畢後,以下是我們將要處理的前五行數據:

from io import StringIO
col = ['Product', 'Consumer complaint narrative']

df = df[col]
df = df[pd.notnull(df['Consumer complaint narrative'])]
df.columns = ['Product', 'Consumer_complaint_narrative']
df['category_id'] = df['Product'].factorize()[0]
category_id_df = df[['Product', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Product']].values)
df.head()


使用scikit-learn解決文本多分類問題(附python演練)


三、不平衡的類

我們發現每種產品收到的投訴記錄的數量是不平衡的。消費者的投訴更傾向於Debt collection(債款收回),Credit reporting (信用報告),

Mortgage(抵押貸款。)

Import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,6))
df.groupby('Product').Consumer_complaint_narrative.count().plot.bar(ylim=0)
plt.show()


使用scikit-learn解決文本多分類問題(附python演練)


當我們遇到這樣的問題時,如果用一般算法去解決問題就會遇到很多困難。傳統算法通常不考慮數據分佈,而傾向數量較大的類別。在最壞的情況下,少數群體會視為異常值被忽略。對於某些場景,例如欺詐檢測或癌症預測,我們需要仔細配置我們的模型或人為地對數據集做再平衡處理,例如通過對每個類進行欠採樣或過採樣。

但是在我們今天這個例子裡,數量多的類別正好可能是我們最感興趣的部分。我們希望訓練出這樣一種分類器,該分類器在數量多的類別上提供高預測精度,同時又保持樣本較少的類的合理準確性。因此,我們打算讓數據集的比例保持原樣,不做改變。

四、文本表示

分類器和學習算法沒辦法對文本的原始形式做直接處理,因為它們期望的輸入是長度固定且為數值型的特徵向量,而不是具有可變長度的原始文本。因此,在預處理階段,文本需要被轉換為更易於操作的表示形式。

從文本中提取特徵的一種常用方法是使用詞袋模型:對於每條文本樣本,也即本案例中的Consumer_complaint_narrative,詞袋模型會考慮單詞的出現頻率,但忽略它們出現的順序。

具體來說,對於我們數據集中的每個單詞,我們將計算它的詞頻和逆文檔頻率,簡稱tf-idf。我們將使用sklearn.feature_extraction.text.TfidfVectorizer 來計算每個消費者投訴敘述的向量的tf-idf向量:

(1) sublinear_df設置為True使用頻率的對數形式。

(2) min_df 是一個單詞必須存在的最小文檔數量。

(3) norm設置為l2,以確保我們所有的特徵向量是歐幾里德範數為1的向量。

(4) ngram_range設置為(1, 2)表示我們要將文檔的unigrams和bigrams兩種形式的詞條納入我們的考慮。

(5) stop_words被設置為"english"刪除所有諸如普通代詞("a","the",...)的停用詞,以減少噪音特徵的數量。

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()
labels = df.category_id
features.shape
(4569,12633)

現在,4569個消費者投訴敘述記錄中的每一條都有12633個特徵,代表不同的unigrams和bigrams的tf-idf分數。

我們可以用sklearn.feature_selection.chi2查找與每種類別(Product)最為相關的詞條:

from sklearn.feature_selection import chi2
import numpy as np
N = 2
for Product, category_id in sorted(category_to_id.items()):
features_chi2 = chi2(features, labels == category_id)
indices = np.argsort(features_chi2[0])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
print("# '{}':".format(Product))
print(" . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-N:])))
print(" . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-N:])))
# ‘Bank account or service’:
. Most correlated unigrams:
. bank
. overdraft
. Most correlated bigrams:
. overdraft fees
. checking account
# ‘Consumer Loan’:

. Most correlated unigrams:
. car
. vehicle
. Most correlated bigrams:
. vehicle xxxx
. toyota financial
# ‘Credit card’:
. Most correlated unigrams:
. citi
. card
. Most correlated bigrams:
. annual fee
. credit card
# ‘Credit reporting’:
. Most correlated unigrams:
. experian
. equifax
. Most correlated bigrams:
. trans union
. credit report
# ‘Debt collection’:
. Most correlated unigrams:
. collection
. debt
. Most correlated bigrams:
. collect debt
. collection agency
# ‘Money transfers’:
. Most correlated unigrams:
. wu
. paypal
. Most correlated bigrams:
. western union
. money transfer
# ‘Mortgage’:
. Most correlated unigrams:
. modification
. mortgage
. Most correlated bigrams:
. mortgage company
. loan modification


上面列出來的詞條跟類別的匹配,看上去是不是好像有點道理?

五、多類標分類器:特徵與設計

1. 為了訓練有監督的分類器,我們首先將“Consumer_complaint_narrative”轉變為數值向量。我們探索了諸如TF-IDF加權向量這樣的向量表示。

2. 在文本有了自己的向量表示之後,我們就可以來訓練有監督分類器模型,並對那些新來的“Consumer_complaint_narrative”預測它們所屬的“Product”。

完成上述所有數據轉換後,現在我們已經擁有了所有的特徵和標籤,現在是時候訓練分類器了。我們可以使用許多算法來解決這類問題。

3. 樸素貝葉斯分類器:最適合單詞統計的自然是樸素貝葉斯多項式模型:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
X_train, X_test, y_train, y_test = train_test_split(df['Consumer_complaint_narrative'], df['Product'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)

在對訓練集訓練之後,讓我們用它來做一些預測。

print(clf.predict(count_vect.transform(["This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine."])))
[‘Debt collection’]
df[df['Consumer_complaint_narrative'] == "This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine."]
print(clf.predict(count_vect.transform(["I am disputing the inaccurate information the Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16 and Chex Systems only deleted the items that I mentioned in the letter and not all the items that were actually listed on the police report. In other words they wanted me to say word for word to them what items were fraudulent. The total disregard of the police report and what accounts that it states that are fraudulent. If they just had paid a little closer attention to the police report I would not been in this position now and they would n't have to research once again. I would like the reported information to be removed : XXXX XXXX XXXX"])))
[‘Credit reporting’]
df[df['Consumer_complaint_narrative'] == "I am disputing the inaccurate information the Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16 and Chex Systems only deleted the items that I mentioned in the letter and not all the items that were actually listed on the police report. In other words they wanted me to say word for word to them what items were fraudulent. The total disregard of the police report and what accounts that it states that are fraudulent. If they just had paid a little closer attention to the police report I would not been in this position now and they would n't have to research once again. I would like the reported information to be removed : XXXX XXXX XXXX"]


效果還不錯!

六、模型選擇

我們現在已經準備好嘗試更多不同的機器學習模型,評估它們的準確性並找出任何潛在問題的根源。

我們將檢測以下四種模型:

  • 邏輯迴歸
  • (多項式)樸素貝葉斯
  • 線性支持向量機
  • 隨機森林
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
models = [
RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
LinearSVC(),
MultinomialNB(),
LogisticRegression(random_state=0),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
import seaborn as sns
sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df,
size=8, jitter=True, edgecolor="gray", linewidth=2)
plt.show()
cv_df.groupby('model_name').accuracy.mean()
model_name
LinearSVC: 0.822890
LogisticRegression: 0.792927
MultinomialNB: 0.688519
RandomForestClassifier: 0.443826
Name: accuracy, dtype: float64

線性支持向量機和邏輯迴歸比其他兩個分類器表現更好,線性支持向量機略佔優勢,中值精度約為82%。

七、模型評估

接著繼續探索我們的最佳模型(LinearSVC),先查看它混淆矩陣,然後顯示預測值和實際標籤之間的差異。

model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, df.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(conf_mat, annot=True, fmt='d',
xticklabels=category_id_df.Product.values, yticklabels=category_id_df.Product.values)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

預測結果的絕大多數都位於對角線上(預測標籤=實際標籤),也就是我們希望它們會落到的地方。但是還是存在不少錯誤的分類,找到錯誤的原因也是一件有意思的事情:

from IPython.display import display
for predicted in category_id_df.category_id:
for actual in category_id_df.category_id:
if predicted != actual and conf_mat[actual, predicted] >= 10:
print("'{}' predicted as '{}' : {} examples.".format(id_to_category[actual], id_to_category[predicted], conf_mat[actual, predicted]))
display(df.loc[indices_test[(y_test == actual) & (y_pred == predicted)]][['Product', 'Consumer_complaint_narrative']])
print('')

正如您所看到的,一些錯誤分類的投訴往往涉及了多個主題(例如,同時涉及信用卡和信用報告兩方面的投訴)。這種錯誤總會發生。

接著我們再一次使用卡方檢驗來查找與每個類別最相關的詞條:

model.fit(features, labels)
N = 2
for Product, category_id in sorted(category_to_id.items()):
indices = np.argsort(model.coef_[category_id])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 1][:N]
bigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 2][:N]
print("# '{}':".format(Product))
print(" . Top unigrams:\n . {}".format('\n . '.join(unigrams)))
print(" . Top bigrams:\n . {}".format('\n . '.join(bigrams)))
# ‘Bank account or service’:
. Top unigrams:
. bank
. account
. Top bigrams:
. debit card
. overdraft fees
# ‘Consumer Loan’:
. Top unigrams:
. vehicle
. car
. Top bigrams:
. personal loan
. history xxxx
# ‘Credit card’:
. Top unigrams:
. card
. discover
. Top bigrams:
. credit card
. discover card
# ‘Credit reporting’:
. Top unigrams:
. equifax
. transunion
. Top bigrams:
. xxxx account
. trans union
# ‘Debt collection’:
. Top unigrams:
. debt
. collection
. Top bigrams:

. account credit
. time provided

結果與我們的期望一致。

最後,我們打印出每個類別的分類報告

from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names=df['Product'].unique()))

以上源代碼(https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb)

都可以在Github上找到。

對深度學習感興趣,熱愛Tensorflow的小夥伴,歡迎關注我們的網站!http://www.panchuang.net 我們的公眾號:磐創AI。


分享到:


相關文章: