xgboost 庫使用入門

本文 github 地址：1-1 基本模型呼叫。 ipynb，裡面會記錄自己kaggle大賽中的內容，歡迎start關注。

# 開啟多行顯示from IPython。core。interactiveshell import InteractiveShell InteractiveShell。ast_node_interactivity = “all”# InteractiveShell。ast_node_interactivity = “last_expr”# 顯示圖片%matplotlib inline %config InlineBackend。figure_format = ‘retina’

資料探索

XGBoost中資料形式可以是libsvm的，libsvm作用是對稀疏特徵進行最佳化，看個例子：

1 101：1。2 102：0。03 0 1：2。1 10001：300 10002：400 0 2：1。2 1212：21 7777：2

每行表示一個樣本，每行開頭0，1表示標籤，而後面的則是特徵索引：數值，其他未表示都是0。

我們以判斷蘑菇是否有毒為例子來做後續的訓練。資料集來自：http：//archive。ics。uci。edu/ml/machine-learning-databases/mushroom/ ，其中蘑菇有22個屬性，將這些原始的特徵加工後得到126維特徵，並儲存為libsvm格式，標籤是表示蘑菇是否有毒。其中其中 6513 個樣本做訓練，1611 個樣本做測試。

import xgboost as xgbfrom sklearn。metrics import accuracy_score

DMatrix is a internal data structure that used by XGBoost

which is optimized for both memory efficiency and training speed。

DMatrix 的資料來源可以是 string/numpy array/scipy。sparse/pd。DataFrame，如果是 string，則代表 libsvm 檔案的路徑，或者是 xgboost 可讀取的二進位制檔案路徑。

data_fold = “。/data/”dtrain = xgb。DMatrix（data_fold + “agaricus。txt。train”） dtest = xgb。DMatrix（data_fold + “agaricus。txt。test”）

檢視資料情況

（dtrain。num_col（），dtrain。num_row（））（dtest。num_col（），dtest。num_row（））

（127， 6513）（127， 1611）

模型訓練

基本引數設定：

max_depth：樹的最大深度。預設值為6，取值範圍為：［1，∞］

eta：為了防止過擬合，更新過程中用到的收縮步長。eta透過縮減特徵的權重使提升計算過程更加保守。預設值為0。3，取值範圍為：［0，1］

silent： 0表示打印出執行時資訊，取1時表示以緘默方式執行，不列印執行時資訊。預設值為0

objective：定義學習任務及相應的學習目標，“binary：logistic” 表示二分類的邏輯迴歸問題，輸出為機率。

param = {‘max_depth’：2， ‘eta’：1， ‘silent’：， ‘objective’：‘binary：logistic’ }

%time# 設定boosting迭代計算次數num_round = 2bst = xgb。train（param， dtrain， num_round）

CPU times： user 0 ns， sys： 0 ns， total： 0 ns Wall time： 65。6 s

此處模型輸出是一個機率值，我們將其轉換為0-1值，然後再計算準確率

train_preds = bst。predict（dtrain） train_predictions = ［round（value） for value in train_preds］ y_train = dtrain。get_label（） train_accuracy = accuracy_score（y_train， train_predictions）print （“Train Accuary： %。2f%%” % （train_accuracy * 100。0））

Train Accuary： 97。77%

我們最後再測試集上看下模型的準確率的

preds = bst。predict（dtest） predictions = ［round（value） for value in preds］ y_test = dtest。get_label（） test_accuracy = accuracy_score（y_test， predictions） print（“Test Accuracy： %。2f%%” % （test_accuracy * 100。0））

Test Accuracy： 97。83%

from matplotlib import pyplotimport graphviz xgb。to_graphviz（bst， num_trees= ） pyplot。show（）

scikit-learn 介面格式

from xgboost import XGBClassifierfrom sklearn。datasets import load_svmlight_file

my_workpath = ‘。/data/’X_train，y_train = load_svmlight_file（my_workpath + ‘agaricus。txt。train’） X_test，y_test = load_svmlight_file（my_workpath + ‘agaricus。txt。test’）# 設定boosting迭代計算次數num_round = 2#bst = XGBClassifier（**params）#bst = XGBClassifier（）bst =XGBClassifier（max_depth=2， learning_rate=1， n_estimators=num_round， silent=True， objective=‘binary：logistic’） bst。fit（X_train， y_train）

XGBClassifier（base_score=0。5， booster=‘gbtree’， colsample_bylevel=1， colsample_bytree=1， gamma=0， learning_rate=1， max_delta_step=0， max_depth=2， min_child_weight=1， missing=None， n_estimators=2， n_jobs=1， nthread=None， objective=‘binary：logistic’， random_state=0， reg_alpha=0， reg_lambda=1， scale_pos_weight=1， seed=None， silent=True， subsample=1）

# 訓練集上準確率train_preds = bst。predict（X_train） train_predictions = ［round（value） for value in train_preds］ train_accuracy = accuracy_score（y_train， train_predictions）print （“Train Accuary： %。2f%%” % （train_accuracy * 100。0））

Train Accuary： 97。77%

# 測試集上準確率# make predictionpreds = bst。predict（X_test） predictions = ［round（value） for value in preds］ test_accuracy = accuracy_score（y_test， predictions） print（“Test Accuracy： %。2f%%” % （test_accuracy * 100。0））

Test Accuracy： 97。83%

scikit-lean 中 cv 使用

做cross_validation主要用到下面 StratifiedKFold 函式

# 設定boosting迭代計算次數num_round = 2bst =XGBClassifier（max_depth=2， learning_rate=0。1，n_estimators=num_round， silent=True， objective=‘binary：logistic’）

from sklearn。model_selection import StratifiedKFoldfrom sklearn。model_selection import cross_val_score

kfold = StratifiedKFold（n_splits=10， random_state=7） results = cross_val_score（bst， X_train， y_train， cv=kfold） print（results） print（“CV Accuracy： %。2f%% （%。2f%%）” % （results。mean（）*100， results。std（）*100））

［ 0。69478528 0。85276074 0。95398773 0。97235023 0。96006144 0。98771121 1。 1。 0。96927803 0。97695853］ CV Accuracy： 93。68% （9。00%）

GridSearchcv 搜尋最優解

from sklearn。model_selection import GridSearchCV

bst =XGBClassifier（max_depth=2， learning_rate=0。1， silent=True， objective=‘binary：logistic’）

%time param_grid = { ‘n_estimators’： range（1， 51， 1） } clf = GridSearchCV（bst， param_grid， “accuracy”，cv=5） clf。fit（X_train， y_train）

CPU times： user 0 ns， sys： 0 ns， total： 0 ns Wall time： 24。3 s

clf。best_params_， clf。best_score_

（{‘n_estimators’： 30}， 0。98418547520343924）

## 在測試集合上測試#make predictionpreds = clf。predict（X_test） predictions = ［round（value） for value in preds］ test_accuracy = accuracy_score（y_test， predictions） print（“Test Accuracy of gridsearchcv： %。2f%%” % （test_accuracy * 100。0））

Test Accuracy of gridsearchcv： 97。27%

early-stop

我們設定驗證valid集，當我們迭代過程中發現在驗證集上錯誤率增加，則提前停止迭代。

from sklearn。model_selection import train_test_split

seed = 7test_size = 0。33X_train_part， X_validate， y_train_part， y_validate= train_test_split（X_train， y_train， test_size=test_size， random_state=seed） X_train_part。shape X_validate。shape

（4363， 126）（2150， 126）

# 設定boosting迭代計算次數num_round = 100bst =XGBClassifier（max_depth=2， learning_rate=0。1， n_estimators=num_round， silent=True， objective=‘binary：logistic’） eval_set =［（X_validate， y_validate）］ bst。fit（X_train_part， y_train_part， early_stopping_rounds=10， eval_metric=“error”， eval_set=eval_set， verbose=True）

［0］ validation_0-error：0。048372 Will train until validation_0-error hasn‘t improved in 10 rounds。［1］ validation_0-error：0。042326 ［2］ validation_0-error：0。048372 ［3］ validation_0-error：0。042326 ［4］ validation_0-error：0。042326 ［5］ validation_0-error：0。042326 ［6］ validation_0-error：0。023256 ［7］ validation_0-error：0。042326 ［8］ validation_0-error：0。042326 ［9］ validation_0-error：0。023256 ［10］ validation_0-error：0。006512 ［11］ validation_0-error：0。017674 ［12］ validation_0-error：0。017674 ［13］ validation_0-error：0。017674 ［14］ validation_0-error：0。017674 ［15］ validation_0-error：0。017674 ［16］ validation_0-error：0。017674 ［17］ validation_0-error：0。017674 ［18］ validation_0-error：0。024651 ［19］ validation_0-error：0。020465 ［20］ validation_0-error：0。020465 Stopping。 Best iteration：［10］ validation_0-error：0。006512

我們可以將上面的錯誤率進行視覺化，方便我們更直觀的觀察

results = bst。evals_result（）#print（results）epochs = len（results［’validation_0‘］［’error‘］） x_axis = range（， epochs）# plot log lossfig， ax = pyplot。subplots（） ax。plot（x_axis， results［’validation_0‘］［’error‘］， label=’Test‘） ax。legend（） pyplot。ylabel（’Error‘） pyplot。xlabel（’Round‘） pyplot。title（’XGBoost Early Stop‘） pyplot。show（）

Test Accuracy： 97。27%

學習曲線

# 設定boosting迭代計算次數num_round = 100# 沒有 eraly_stopbst =XGBClassifier（max_depth=2， learning_rate=0。1， n_estimators=num_round， silent=True， objective=’binary：logistic‘） eval_set = ［（X_train_part， y_train_part），（X_validate， y_validate）］ bst。fit（X_train_part， y_train_part， eval_metric=［“error”， “logloss”］， eval_set=eval_set， verbose=True）

# retrieve performance metricsresults = bst。evals_result（）#print（results）epochs = len（results［’validation_0‘］［’error‘］） x_axis = range（， epochs）# plot log lossfig， ax = pyplot。subplots（） ax。plot（x_axis， results［’validation_0‘］［’logloss‘］， label=’Train‘） ax。plot（x_axis， results［’validation_1‘］［’logloss‘］， label=’Test‘） ax。legend（） pyplot。ylabel（’Log Loss‘） pyplot。title（’XGBoost Log Loss‘） pyplot。show（）# plot classification errorfig， ax = pyplot。subplots（） ax。plot（x_axis， results［’validation_0‘］［’error‘］， label=’Train‘） ax。plot（x_axis， results［’validation_1‘］［’error‘］， label=’Test‘） ax。legend（） pyplot。ylabel（’Classification Error‘） pyplot。title（’XGBoost Classification Error‘） pyplot。show（）

# make predictionpreds = bst。predict（X_test） predictions = ［round（value） for value in preds］ test_accuracy = accuracy_score（y_test， predictions） print（“Test Accuracy： %。2f%%” % （test_accuracy * 100。0））

Test Accuracy： 99。81%

xgboost 庫使用入門

xgboost 庫使用入門

相關文章

推薦文章