LightGBM的意思是輕量級(light)的梯度提升機(GBM),其相對Xgboost具有訓練速度快、內存佔用低的特點。關於lgb針對xgb做的優化,後面想寫一篇文章複習一下。本篇文章主要講解如何利用lgb建立一張評分卡,不涉及公式推導。關於lgb的基礎使用教程,由於和xgb有許多相似之處,這裡放一篇基礎教程的鏈接。
LightGBM使用簡單介紹:https://mathpretty.com/10649.html
本文是梅子行老師的金融風控實戰課程的筆記。
<code>import
pandasas
pd from sklearn.metricsimport
roc_auc_score,roc_curve,auc from sklearn.model_selectionimport
train_test_split from sklearnimport
metrics from sklearn.linear_modelimport
LogisticRegressionimport
numpyas
npimport
randomimport
mathimport
timeimport
lightgbmas
lgbdata
= pd.read_csv('Acard.txt'
) df_train =data
[data
.obs_mth !='2018-11-30'
].reset_index().copy()val
=data
[data
.obs_mth =='2018-11-30'
].reset_index().copy() lst = ['person_info'
,'finance_info'
,'credit_info'
,'act_info'
,'td_score'
,'jxl_score'
,'mj_score'
,'rh_score'
]/<code>
變量都是數值型,無需進行處理。由於lgb採用的是cart迴歸樹,所以只能接受數值特徵輸入,不直接支持類別特徵。對於類別特徵,可以進行one-hot編碼或者label-encoding編碼轉化成數值型變量。 然後選取跨時間驗證集,將2018年11月份的數據選作跨時間驗證集,用於評估模型的表現。 一共有8個入模變量,其中info結尾的是無監督系統輸出的個人表現,score結尾的是收費的外部徵信數據。
<code>df_train = df_train.sort_values(by ='obs_mth'
,ascending = False) rank_lst = []for
i inrange
(1
,len
(df_train)+1
): rank_lst.append
(i) df_train['rank'
] = rank_lst df_train['rank'
] = df_train['rank'
]/len
(df_train) pct_lst = []for
x in df_train['rank'
]:if
x <=0.2
: x =1
elif x <=0.4
: x =2
elif x <=0.6
: x =3
elif x <=0.8
: x =4
else
: x =5
pct_lst.append
(x) df_train['rank'
] = pct_lst #train = train.drop('obs_mth'
,axis =1
) df_train.head()/<code>
這裡將樣本均分為5份,並打上相應的標籤,目的是為了訓練中進行交叉驗證。這裡有一個注意的地方,在機器學習中,一般數據集可以分為訓練集、驗證集、測試集,但是在信貸風控領域中驗證集和測試集定義正好相反。跨時間驗證集其實是測試集,而上面作交叉驗證的“測試集”其實才是驗證集。總之,驗證集是為了模型調參用的,而測試集是為了看模型的泛化效果。
<code>def
LGB_test
(train_x,train_y,test_x,test_y)
:from
multiprocessingimport
cpu_count clf = lgb.LGBMClassifier( boosting_type='gbdt'
, num_leaves=31
, reg_alpha=0.0
, reg_lambda=1
, max_depth=2
, n_estimators=800
,max_features =140
, objective='binary'
, subsample=0.7
, colsample_bytree=0.7
, subsample_freq=1
, learning_rate=0.05
, min_child_weight=50
,random_state=None
,n_jobs=cpu_count()-1
, num_iterations =800
) clf.fit(train_x, train_y,eval_set=[(train_x, train_y),(test_x,test_y)],eval_metric='auc'
,early_stopping_rounds=100
) print(clf.n_features_)return
clf,clf.best_score_['valid_1'
]['auc'
]/<code>
定義lightgbm函數,這裡用的是sklearn接口的方法。lgb建模和xgb一樣,也有兩種方法。可以看到,參數也有很多是一樣的。解釋一下幾個參數的含義:
'num_leaves':一顆樹上的葉子數,默認為31. 'reg_alpha':L1 正則,默認為0,在xgb自帶建模中為lambda_l1。
'objective':損失函數。默認為regression,即迴歸問題的均方誤差損失函數。這裡用的binary是對數損失函數作為目標函數,表示二分類任務。
'min_child_weight':子節點上最小的樣本權重和。指建立每個模型所需要的最小樣本數,可以用來處理過擬合。
'subsample_freq':bagging 的頻率,默認為1。即多少次迭代之後進行一次bagging。
n_features_為入模特徵的個數,這裡為8。best_score_表示模型的最佳得分,是一個字典,返回驗證集上的AUC值。
<code>feature_lst = {} ks_train_lst = [] ks_test_lst = []for
rkin
set(df_train['rank'
]): ttest = df_train[df_train['rank'
] == rk] ttrain = df_train[df_train['rank'
] != rk] train = ttrain[lst] train_y = ttrain.bad_ind test = ttest[lst] test_y = ttest.bad_ind start =time
.time
() model,auc = LGB_test(train,train_y,test,test_y)end
=time
.time
() #模型貢獻度放在feture中 feature = pd.DataFrame( {'name'
: model.booster_.feature_name(),'importance'
: model.feature_importances_ }).sort_values(by = ['importance'
],ascending = False)/<code>
<code> y_pred_train_lgb = model.predict_proba(train)[:, 1] y_pred_test_lgb = model.predict_proba(test
)[:, 1] train_fpr_lgb, train_tpr_lgb, _ = roc_curve(train_y, y_pred_train_lgb) test_fpr_lgb, test_tpr_lgb, _ = roc_curve(test_y, y_pred_test_lgb) train_ks = abs(train_fpr_lgb - train_tpr_lgb).max() test_ks = abs(test_fpr_lgb - test_tpr_lgb).max() train_auc = metrics.auc(train_fpr_lgb, train_tpr_lgb) test_auc = metrics.auc(test_fpr_lgb, test_tpr_lgb) ks_train_lst.append(train_ks) ks_test_lst.append(test_ks) feature_lst[str(rk)] = feature[feature.importance>=20].name train_ks = np.mean(ks_train_lst) test_ks = np.mean(ks_test_lst) ft_lst = {}for
iin
range(1,6): ft_lst[str(i)] = feature_lst[str(i)] fn_lst=list(set
(ft_lst['1'
]) &set
(ft_lst['2'
]) &set
(ft_lst['3'
]) &set
(ft_lst['4'
]) &set
(ft_lst['5'
]))'train_ks: '
,train_ks)'test_ks: '
,test_ks)'ft_lst: '
,fn_lst )/<code>
這裡的ks值取的是每次交叉驗證的ks值的平均值。特徵重要性大於20的有4個,是將所有交叉驗證過程中特徵重要度大於20的特徵去重,最後得到4個重要度最高的特徵。模型booster_.feature_name()參數保存特徵的名字,feature_importances_保存特徵的重要性。
然後用這4個變量入模,看下模型在跨時間驗證集(測試集)上的表現。
<code>lst = ['person_info'
,'finance_info'
,'credit_info'
,'act_info'
] train =data
[data
.obs_mth !='2018-11-30'
].reset_index().copy() evl =data
[data
.obs_mth =='2018-11-30'
].reset_index().copy() x = train[lst] y = train['bad_ind'
] evl_x = evl[lst] evl_y = evl['bad_ind'
] model,auc = LGB_test(x,y,evl_x,evl_y) y_pred = model.predict_proba(x)[:,1
] fpr_lgb_train,tpr_lgb_train,_ = roc_curve(y,y_pred) train_ks = abs(fpr_lgb_train - tpr_lgb_train).max() print('train_ks : '
,train_ks) y_pred = model.predict_proba(evl_x)[:,1
] fpr_lgb,tpr_lgb,_ = roc_curve(evl_y,y_pred) evl_ks = abs(fpr_lgb - tpr_lgb).max() print('evl_ks : '
,evl_ks) from matplotlibimport
pyplotas
plt plt.plot(fpr_lgb_train,tpr_lgb_train,label ='train LR'
) plt.plot(fpr_lgb,tpr_lgb,label ='evl LR'
) plt.plot([0
,1
],[0
,1
],'k--'
) plt.xlabel('False positive rate'
) plt.ylabel('True positive rate'
) plt.title('ROC Curve'
) plt.legend(loc ='best'
) plt.show()/<code>
最後將概率映射成得分,並生成得分報告。
<code>def
score
(xbeta)
: score =1000
-500
*(math.log2(1
-xbeta)/xbeta)return
score evl['xbeta'
] = model.predict_proba(evl_x)[:,1
] evl['score'
] = evl.apply(lambda
x : score(x.xbeta) ,axis=1
)/<code>
<code>#生成報告 row_num, col_num = 0, 0 bins = 20 Y_predict = evl['score'] Y = evl_y nrows = Y.shape[0] lis = [(Y_predict[i], Y[i]) for i in range(nrows)] ks_lis = sorted(lis, key=lambda x: x[0], reverse=True) bin_num = int(nrows/bins+1) bad = sum([1 for (p, y) in ks_lis if y > 0.5]) good = sum([1 for (p, y) in ks_lis if y<
=
0.5
])bad_cnt
,good_cnt
=0,
0
KS
=[]
BAD
=[]
GOOD
=[]
BAD_CNT
=[]
GOOD_CNT
=[]
BAD_PCTG
=[]
BADRATE
=[]
dct_report
={}
for
j
in
range
(bins
):
ds
=ks_lis[j*bin_num:
min
((j
+1
)*bin_num
,nrows
)]bad1
=sum([1
for
(p
,y
)in
ds
if
y
> 0.5]) good1 = sum([1 for (p, y) in ds if y<
=
0.5
])bad_cnt
+=bad1
good_cnt
+=good1
bad_pctg
=round(bad_cnt/sum(evl_y),3)
badrate
=round(bad1/(bad1+good1),3)
ks
=round(math.fabs((bad_cnt
/bad
)-
(good_cnt
/good
)),3
)KS.append
(ks
)BAD.append
(bad1
)GOOD.append
(good1
)BAD_CNT.append
(bad_cnt
)GOOD_CNT.append
(good_cnt
)BAD_PCTG.append
(bad_pctg
)BADRATE.append
(badrate
)dct_report
['KS
'] =KS
dct_report
['BAD
'] =BAD
dct_report
['GOOD
'] =GOOD
dct_report
['BAD_CNT
'] =BAD_CNT
dct_report
['GOOD_CNT
'] =GOOD_CNT
dct_report
['BAD_PCTG
'] =BAD_PCTG
dct_report
['BADRATE
'] =BADRATE
val_repot
=pd.DataFrame(dct_report)
/<code>
【作者】:Labryant
【原創公眾號】:風控獵人
【簡介】:某創業公司策略分析師,積極上進,努力提升。乾坤未定,你我都是黑馬。
【轉載說明】:轉載請說明出處,謝謝合作!~