每天資訊xgboost 庫使用入門

菜單

xgboost 庫使用入門

xgboost 庫使用入門

xgboost 庫使用入門

本文 github 地址:1-1 基本模型呼叫。 ipynb,裡面會記錄自己kaggle大賽中的內容,歡迎start關注。

# 開啟多行顯示from IPython。core。interactiveshell import InteractiveShell InteractiveShell。ast_node_interactivity = “all”# InteractiveShell。ast_node_interactivity = “last_expr”# 顯示圖片%matplotlib inline %config InlineBackend。figure_format = ‘retina’

資料探索

XGBoost中資料形式可以是libsvm的,libsvm作用是對稀疏特徵進行最佳化,看個例子:

1 101:1。2 102:0。03 0 1:2。1 10001:300 10002:400 0 2:1。2 1212:21 7777:2

每行表示一個樣本,每行開頭0,1表示標籤,而後面的則是特徵索引:數值,其他未表示都是0。

我們以判斷蘑菇是否有毒為例子來做後續的訓練。資料集來自:http://archive。ics。uci。edu/ml/machine-learning-databases/mushroom/ ,其中蘑菇有22個屬性,將這些原始的特徵加工後得到126維特徵,並儲存為libsvm格式,標籤是表示蘑菇是否有毒。其中其中 6513 個樣本做訓練,1611 個樣本做測試。

import xgboost as xgbfrom sklearn。metrics import accuracy_score

DMatrix is a internal data structure that used by XGBoost

which is optimized for both memory efficiency and training speed。

DMatrix 的資料來源可以是 string/numpy array/scipy。sparse/pd。DataFrame,如果是 string,則代表 libsvm 檔案的路徑,或者是 xgboost 可讀取的二進位制檔案路徑。

data_fold = “。/data/”dtrain = xgb。DMatrix(data_fold + “agaricus。txt。train”) dtest = xgb。DMatrix(data_fold + “agaricus。txt。test”)

檢視資料情況

(dtrain。num_col(),dtrain。num_row()) (dtest。num_col(),dtest。num_row())

(127, 6513) (127, 1611)

模型訓練

基本引數設定:

max_depth: 樹的最大深度。預設值為6,取值範圍為:[1,∞]

eta:為了防止過擬合,更新過程中用到的收縮步長。eta透過縮減特徵 的權重使提升計算過程更加保守。預設值為0。3,取值範圍為:[0,1]

silent: 0表示打印出執行時資訊,取1時表示以緘默方式執行,不列印 執行時資訊。預設值為0

objective: 定義學習任務及相應的學習目標,“binary:logistic” 表示 二分類的邏輯迴歸問題,輸出為機率。

param = {‘max_depth’:2, ‘eta’:1, ‘silent’:, ‘objective’:‘binary:logistic’ }

%time# 設定boosting迭代計算次數num_round = 2bst = xgb。train(param, dtrain, num_round)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 65。6 s

此處模型輸出是一個機率值,我們將其轉換為0-1值,然後再計算準確率

train_preds = bst。predict(dtrain) train_predictions = [round(value) for value in train_preds] y_train = dtrain。get_label() train_accuracy = accuracy_score(y_train, train_predictions)print (“Train Accuary: %。2f%%” % (train_accuracy * 100。0))

Train Accuary: 97。77%

我們最後再測試集上看下模型的準確率的

preds = bst。predict(dtest) predictions = [round(value) for value in preds] y_test = dtest。get_label() test_accuracy = accuracy_score(y_test, predictions) print(“Test Accuracy: %。2f%%” % (test_accuracy * 100。0))

Test Accuracy: 97。83%

from matplotlib import pyplotimport graphviz xgb。to_graphviz(bst, num_trees= ) pyplot。show()

scikit-learn 介面格式

from xgboost import XGBClassifierfrom sklearn。datasets import load_svmlight_file

my_workpath = ‘。/data/’X_train,y_train = load_svmlight_file(my_workpath + ‘agaricus。txt。train’) X_test,y_test = load_svmlight_file(my_workpath + ‘agaricus。txt。test’)# 設定boosting迭代計算次數num_round = 2#bst = XGBClassifier(**params)#bst = XGBClassifier()bst =XGBClassifier(max_depth=2, learning_rate=1, n_estimators=num_round, silent=True, objective=‘binary:logistic’) bst。fit(X_train, y_train)

XGBClassifier(base_score=0。5, booster=‘gbtree’, colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=1, max_delta_step=0, max_depth=2, min_child_weight=1, missing=None, n_estimators=2, n_jobs=1, nthread=None, objective=‘binary:logistic’, random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1)

# 訓練集上準確率train_preds = bst。predict(X_train) train_predictions = [round(value) for value in train_preds] train_accuracy = accuracy_score(y_train, train_predictions)print (“Train Accuary: %。2f%%” % (train_accuracy * 100。0))

Train Accuary: 97。77%

# 測試集上準確率# make predictionpreds = bst。predict(X_test) predictions = [round(value) for value in preds] test_accuracy = accuracy_score(y_test, predictions) print(“Test Accuracy: %。2f%%” % (test_accuracy * 100。0))

Test Accuracy: 97。83%

scikit-lean 中 cv 使用

做cross_validation主要用到下面 StratifiedKFold 函式

# 設定boosting迭代計算次數num_round = 2bst =XGBClassifier(max_depth=2, learning_rate=0。1,n_estimators=num_round, silent=True, objective=‘binary:logistic’)

from sklearn。model_selection import StratifiedKFoldfrom sklearn。model_selection import cross_val_score

kfold = StratifiedKFold(n_splits=10, random_state=7) results = cross_val_score(bst, X_train, y_train, cv=kfold) print(results) print(“CV Accuracy: %。2f%% (%。2f%%)” % (results。mean()*100, results。std()*100))

[ 0。69478528 0。85276074 0。95398773 0。97235023 0。96006144 0。98771121 1。 1。 0。96927803 0。97695853] CV Accuracy: 93。68% (9。00%)

GridSearchcv 搜尋最優解

from sklearn。model_selection import GridSearchCV

bst =XGBClassifier(max_depth=2, learning_rate=0。1, silent=True, objective=‘binary:logistic’)

%time param_grid = { ‘n_estimators’: range(1, 51, 1) } clf = GridSearchCV(bst, param_grid, “accuracy”,cv=5) clf。fit(X_train, y_train)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 24。3 s

clf。best_params_, clf。best_score_

({‘n_estimators’: 30}, 0。98418547520343924)

## 在測試集合上測試#make predictionpreds = clf。predict(X_test) predictions = [round(value) for value in preds] test_accuracy = accuracy_score(y_test, predictions) print(“Test Accuracy of gridsearchcv: %。2f%%” % (test_accuracy * 100。0))

Test Accuracy of gridsearchcv: 97。27%

early-stop

我們設定驗證valid集,當我們迭代過程中發現在驗證集上錯誤率增加,則提前停止迭代。

from sklearn。model_selection import train_test_split

seed = 7test_size = 0。33X_train_part, X_validate, y_train_part, y_validate= train_test_split(X_train, y_train, test_size=test_size, random_state=seed) X_train_part。shape X_validate。shape

(4363, 126) (2150, 126)

# 設定boosting迭代計算次數num_round = 100bst =XGBClassifier(max_depth=2, learning_rate=0。1, n_estimators=num_round, silent=True, objective=‘binary:logistic’) eval_set =[(X_validate, y_validate)] bst。fit(X_train_part, y_train_part, early_stopping_rounds=10, eval_metric=“error”, eval_set=eval_set, verbose=True)

[0] validation_0-error:0。048372 Will train until validation_0-error hasn‘t improved in 10 rounds。 [1] validation_0-error:0。042326 [2] validation_0-error:0。048372 [3] validation_0-error:0。042326 [4] validation_0-error:0。042326 [5] validation_0-error:0。042326 [6] validation_0-error:0。023256 [7] validation_0-error:0。042326 [8] validation_0-error:0。042326 [9] validation_0-error:0。023256 [10] validation_0-error:0。006512 [11] validation_0-error:0。017674 [12] validation_0-error:0。017674 [13] validation_0-error:0。017674 [14] validation_0-error:0。017674 [15] validation_0-error:0。017674 [16] validation_0-error:0。017674 [17] validation_0-error:0。017674 [18] validation_0-error:0。024651 [19] validation_0-error:0。020465 [20] validation_0-error:0。020465 Stopping。 Best iteration: [10] validation_0-error:0。006512

我們可以將上面的錯誤率進行視覺化,方便我們更直觀的觀察

results = bst。evals_result()#print(results)epochs = len(results[’validation_0‘][’error‘]) x_axis = range(, epochs)# plot log lossfig, ax = pyplot。subplots() ax。plot(x_axis, results[’validation_0‘][’error‘], label=’Test‘) ax。legend() pyplot。ylabel(’Error‘) pyplot。xlabel(’Round‘) pyplot。title(’XGBoost Early Stop‘) pyplot。show()

# 測試集上準確率# make predictionpreds = bst。predict(X_test) predictions = [round(value) for value in preds] test_accuracy = accuracy_score(y_test, predictions) print(“Test Accuracy: %。2f%%” % (test_accuracy * 100。0))

Test Accuracy: 97。27%

學習曲線

# 設定boosting迭代計算次數num_round = 100# 沒有 eraly_stopbst =XGBClassifier(max_depth=2, learning_rate=0。1, n_estimators=num_round, silent=True, objective=’binary:logistic‘) eval_set = [(X_train_part, y_train_part), (X_validate, y_validate)] bst。fit(X_train_part, y_train_part, eval_metric=[“error”, “logloss”], eval_set=eval_set, verbose=True)

# retrieve performance metricsresults = bst。evals_result()#print(results)epochs = len(results[’validation_0‘][’error‘]) x_axis = range(, epochs)# plot log lossfig, ax = pyplot。subplots() ax。plot(x_axis, results[’validation_0‘][’logloss‘], label=’Train‘) ax。plot(x_axis, results[’validation_1‘][’logloss‘], label=’Test‘) ax。legend() pyplot。ylabel(’Log Loss‘) pyplot。title(’XGBoost Log Loss‘) pyplot。show()# plot classification errorfig, ax = pyplot。subplots() ax。plot(x_axis, results[’validation_0‘][’error‘], label=’Train‘) ax。plot(x_axis, results[’validation_1‘][’error‘], label=’Test‘) ax。legend() pyplot。ylabel(’Classification Error‘) pyplot。title(’XGBoost Classification Error‘) pyplot。show()

# make predictionpreds = bst。predict(X_test) predictions = [round(value) for value in preds] test_accuracy = accuracy_score(y_test, predictions) print(“Test Accuracy: %。2f%%” % (test_accuracy * 100。0))

Test Accuracy: 99。81%