4 분 소요

본 게시물은 Decision Tree에 대해 소개한다.

1. Decision Tree

데이터를 분류하고 예측하기 위해 특정 기준에 따라 ‘Yes/No’로 대답할 수 있는 분기를 두도록 학습

  • 선형성, 정규성 등 가정이 필요하지 않아 전처리 과정에서 모델 성능 영향이 크지 않음
  • 연속형, 범주형 모두 사용 가능
  • 분류 규칙 명확하여 해석이 쉬움

2. Decision Tree Classifier

2.1 Parameters

class sklearn.tree.DecisionTreeClassifier(*, criterion=gini, splitter=best, 
max_depth=None, min_samples_split=2, min_samples_leaf=1, 
min_weight_fraction_leaf=0.0, max_features=None, random_state=None, 
max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, 
ccp_alpha=0.0)
  • criterion: 노드 분할시 사용 함수
  • max_depth: 트리의 최대 깊이
  • min_samples_split: 노드 분할시 필요한 최소 데이터 수
  • min_samples_leaf: 리프노드에 있어야할 최소 데이터 수
  • max_features: 노드 분할시 사용할 feature의 수, None or auto일 경우 원본 데이터 개수를 사용
  • max_leaf_node: 리프노드의 최대 데이터 수
  • max_impurity_decrease: 노드가 분할되는 조건으로 해당 값보다 작거나 같은 수준으로 복잡도가 감소할 시 분할
  • ccp_alpha: pruning에 사용되는 parameter로 ccp-alpha보다 작은 비용 - 복잡성을 가진 서브트리 중 가장 비용 - 복잡성이 큰 트리를 선택함, 0일경우 prune하지 않음

2.2 Attributes

  • feature_importances_: 변수 중요도를 반화

2.3 Method

  • cost_complexity_pruning_path
    • X : 샘플데이터. 2차원 array 형태로 입력
    • y : 타깃데이터. (n_samples,) 또는 (n_samples, n_targets) 형태로 입력
    • 딕셔너리 형태의 ccp_path가 반환. ccp_alphas는 가지치기 동안의 서브트리에 대한 effective alpha값이며, impurities는 ccp_alpha값에 상응하는 서브트리 리프들의 불순도 합

2.4 Implementation

😗 데이터 불러오기

import pandas as pd
credit = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/credit_final.csv")
credit

# feature & target 생성
feature_columns = list(credit.columns.difference(['credit.rating']))
X = credit[feature_columns]
y = credit['credit.rating']

# train-test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify =y, 
test_size =0.3, random_state =1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(700, 20)
(300, 20)
(700,)
(300,)
# modeling
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier(max_depth =5)
clf.fit(x_train, y_train)
# 평가
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
pred=clf.predict(x_test)
test_cm=confusion_matrix(y_test, pred)
test_acc=accuracy_score(y_test, pred)
test_prc=precision_score(y_test, pred)
test_rcll=recall_score(y_test, pred)
test_f1=f1_score(y_test, pred)
print(test_cm)
print('\n')
print('정확도\t{}%'.format(round(test_acc *100,2)))
print('정밀도\t{}%'.format(round(test_prc *100,2)))
print('재현율\t{}%'.format(round(test_rcll *100,2)))
[[ 28  62]
 [ 27 183]]

정확도	70.33%
정밀도	74.69%
재현율	87.14%
# 평가표
from sklearn.metrics import classification_report
report = classification_report(y_test, pred)
print(report)
              precision    recall  f1-score   support

           0       0.51      0.31      0.39        90
           1       0.75      0.87      0.80       210

    accuracy                           0.70       300
   macro avg       0.63      0.59      0.60       300
weighted avg       0.68      0.70      0.68       300
# roc
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(x_test)[:, 1])

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()

# auc
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_test, clf.predict_proba(x_test)[:, 1])
print("ROC_AUC_score : ", roc_auc)
ROC_AUC_score :  0.717857142857143
# feature importance
importances = clf.feature_importances_
column_nm = pd.DataFrame(X.columns)
feature_importances = pd.concat([column_nm, pd.DataFrame(importances)], axis=1)
feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)
                        feature_nm  importances
0                  account.balance     0.263282
1                              age     0.112494
2                   apartment.type     0.021665
3                     bank.credits     0.000000
4                    credit.amount     0.095584
5           credit.duration.months     0.187908
6                   credit.purpose     0.059083
7                   current.assets     0.000000
8                       dependents     0.000000
9              employment.duration     0.000000
10                  foreign.worker     0.000000
11                       guarantor     0.011790
12                installment.rate     0.000000
13                  marital.status     0.016325
14                      occupation     0.000000
15                   other.credits     0.034003
16  previous.credit.payment.status     0.123825
17              residence.duration     0.020960
18                         savings     0.053080
19                       telephone     0.000000
# feature importance 시각화
import os
os.environ["PATH"] += os.pathsep + '/path/to/graphviz/bin'

import numpy as np
feature_names = feature_columns
target_names = np.array(['0', '1'])
import pydot
import pydotplus
import graphviz
from sklearn.tree import export_graphviz
dt_dot_data = export_graphviz(clf, feature_names = feature_names,
 class_names = target_names,
 filled=True, rounded =True,
 special_characters=True)
dt_graph=pydotplus.graph_from_dot_data(dt_dot_data)
from IPython.display import Image
Image(dt_graph.create_png())

3. Decision Tree Regressior

3.1 Parameters

class sklearn.tree.DecisionTreeRegressor(*, criterion=squared_error, splitter=best, 
max_depth=None, min_samples_split=2, min_samples_leaf=1, 
min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, 
min_impurity_decrease=0.0, ccp_alpha=0.0) 
  • classifier와 동일

3.2 Attribute

  • classifier와 동일

3.3 Methods

  • classifier와 동일

3.4 Implementation

😗 데이터 불러오기

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

np.random.seed(0)
X = np.sort(5 * np.random.rand(400, 1), axis =0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()
#노이즈 추가하기
y[::1] +=1 * (0.5 - np.random.rand(400))
plt.scatter(X, y, s=20, edgecolor ="black", c ="darkorange", label="data")

# train-test split
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X,y,train_size =0.7, 
random_state =1)
print(train_x.shape, test_x.shape, train_y.shape, test_y.shape)
(280, 1) (120, 1) (280,) (120,)
# modeling
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error
import pandas as pd
import numpy as np
y_1 = regr_1.fit(train_x, train_y).predict(test_x)
y_2 = regr_2.fit(train_x, train_y).predict(test_x)
preds = [y_1, y_2]
weights = ["max depth = 2", "max depth = 5"]
evls = ['mse', 'rmse', 'mae']
results=pd.DataFrame(index =weights,columns =evls)
for pred, nm in zip(preds, weights):
 mse = mean_squared_error(test_y, pred)
 mae = mean_absolute_error(test_y, pred)
 rmse = np.sqrt(mse)
 
 results.loc[nm]['mse']=round(mse,2)
 results.loc[nm]['rmse']=round(rmse,2)
 results.loc[nm]['mae']=round(mae,2)
results

X_test = np.sort(5 * np.random.rand(40, 1), axis=0)
regrs=[regr_1, regr_2]
depths=["max depth = 2", "max depth = 5"]
model_color=["m", "c"]
fig, axes = plt.subplots(nrows =1, ncols =2, sharey =True, figsize =(13, 5))
for ix, regr in enumerate(regrs):
 pred = regr.fit(X,y).predict(X_test)
 r2 = regr.score(X_test, pred)
 mae=mean_absolute_error(X_test, pred)
 axes[ix].plot(X_test,
                pred,
                color=model_color[ix],
                label="{}".format(depths[ix])
                )
 axes[ix].scatter(X, y, 
                s=20, 
                edgecolor="gray", 
                c="darkorange", 
                label="data"
                )
 axes[ix].legend(loc="upper right",
                ncol=1,
                fancybox=True,
                shadow=True
                )
 axes[ix].set_title("R2 : {r} , MAE : {m}".format(r =round(r2,3), m=round(mae, 3)))
fig.text(0.5, 0.04, "data", ha ="center", va ="center")
fig.text(0.06, 0.5, "target", ha ="center", va ="center", rotation ="vertical")
fig.suptitle("Decision Tree Regression", fontsize =14)
plt.show()

댓글남기기