ADP 실기 9장 Decision Tree

2023-08-18 4 분 소요

본 게시물은 Decision Tree에 대해 소개한다.

1. Decision Tree

데이터를 분류하고 예측하기 위해 특정 기준에 따라 ‘Yes/No’로 대답할 수 있는 분기를 두도록 학습

선형성, 정규성 등 가정이 필요하지 않아 전처리 과정에서 모델 성능 영향이 크지 않음
연속형, 범주형 모두 사용 가능
분류 규칙 명확하여 해석이 쉬움

2. Decision Tree Classifier

2.1 Parameters

class sklearn.tree.DecisionTreeClassifier(*, criterion=‘gini’, splitter=‘best’, 
max_depth=None, min_samples_split=2, min_samples_leaf=1, 
min_weight_fraction_leaf=0.0, max_features=None, random_state=None, 
max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, 
ccp_alpha=0.0)

criterion: 노드 분할시 사용 함수
max_depth: 트리의 최대 깊이
min_samples_split: 노드 분할시 필요한 최소 데이터 수
min_samples_leaf: 리프노드에 있어야할 최소 데이터 수
max_features: 노드 분할시 사용할 feature의 수, None or auto일 경우 원본 데이터 개수를 사용
max_leaf_node: 리프노드의 최대 데이터 수
max_impurity_decrease: 노드가 분할되는 조건으로 해당 값보다 작거나 같은 수준으로 복잡도가 감소할 시 분할
ccp_alpha: pruning에 사용되는 parameter로 ccp-alpha보다 작은 비용 - 복잡성을 가진 서브트리 중 가장 비용 - 복잡성이 큰 트리를 선택함, 0일경우 prune하지 않음

2.2 Attributes

feature_importances_: 변수 중요도를 반화

2.3 Method

cost_complexity_pruning_path
- X : 샘플데이터. 2차원 array 형태로 입력
- y : 타깃데이터. (n_samples,) 또는 (n_samples, n_targets) 형태로 입력
- 딕셔너리 형태의 ccp_path가 반환. ccp_alphas는 가지치기 동안의 서브트리에 대한 effective alpha값이며, impurities는 ccp_alpha값에 상응하는 서브트리 리프들의 불순도 합

2.4 Implementation

😗 데이터 불러오기

import pandas as pd
credit = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/credit_final.csv")
credit

# feature & target 생성
feature_columns = list(credit.columns.difference(['credit.rating']))
X = credit[feature_columns]
y = credit['credit.rating']

# train-test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify =y, 
test_size =0.3, random_state =1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(700, 20)
(300, 20)
(700,)
(300,)

# modeling
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier(max_depth =5)
clf.fit(x_train, y_train)

# 평가
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
pred=clf.predict(x_test)
test_cm=confusion_matrix(y_test, pred)
test_acc=accuracy_score(y_test, pred)
test_prc=precision_score(y_test, pred)
test_rcll=recall_score(y_test, pred)
test_f1=f1_score(y_test, pred)
print(test_cm)
print('\n')
print('정확도\t{}%'.format(round(test_acc *100,2)))
print('정밀도\t{}%'.format(round(test_prc *100,2)))
print('재현율\t{}%'.format(round(test_rcll *100,2)))

[[ 28  62]
 [ 27 183]]

정확도	70.33%
정밀도	74.69%
재현율	87.14%

# 평가표
from sklearn.metrics import classification_report
report = classification_report(y_test, pred)
print(report)

              precision    recall  f1-score   support

           0       0.51      0.31      0.39        90
           1       0.75      0.87      0.80       210

    accuracy                           0.70       300
   macro avg       0.63      0.59      0.60       300
weighted avg       0.68      0.70      0.68       300

# roc
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(x_test)[:, 1])

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()

# auc
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_test, clf.predict_proba(x_test)[:, 1])
print("ROC_AUC_score : ", roc_auc)

ROC_AUC_score :  0.717857142857143

# feature importance
importances = clf.feature_importances_
column_nm = pd.DataFrame(X.columns)
feature_importances = pd.concat([column_nm, pd.DataFrame(importances)], axis=1)
feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)

                        feature_nm  importances
                account.balance     0.263282
                            age     0.112494
                 apartment.type     0.021665
                   bank.credits     0.000000
                  credit.amount     0.095584
         credit.duration.months     0.187908
                 credit.purpose     0.059083
                 current.assets     0.000000
                     dependents     0.000000
            employment.duration     0.000000
                foreign.worker     0.000000
                     guarantor     0.011790
              installment.rate     0.000000
                marital.status     0.016325
                    occupation     0.000000
                 other.credits     0.034003
previous.credit.payment.status     0.123825
            residence.duration     0.020960
                       savings     0.053080
                     telephone     0.000000

# feature importance 시각화
import os
os.environ["PATH"] += os.pathsep + '/path/to/graphviz/bin'

import numpy as np
feature_names = feature_columns
target_names = np.array(['0', '1'])
import pydot
import pydotplus
import graphviz
from sklearn.tree import export_graphviz
dt_dot_data = export_graphviz(clf, feature_names = feature_names,
 class_names = target_names,
 filled=True, rounded =True,
 special_characters=True)
dt_graph=pydotplus.graph_from_dot_data(dt_dot_data)
from IPython.display import Image
Image(dt_graph.create_png())

3. Decision Tree Regressior

3.1 Parameters

class sklearn.tree.DecisionTreeRegressor(*, criterion=‘squared_error’, splitter=‘best’, 
max_depth=None, min_samples_split=2, min_samples_leaf=1, 
min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, 
min_impurity_decrease=0.0, ccp_alpha=0.0) 

classifier와 동일

3.2 Attribute

classifier와 동일

3.3 Methods

classifier와 동일

3.4 Implementation

😗 데이터 불러오기

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

np.random.seed(0)
X = np.sort(5 * np.random.rand(400, 1), axis =0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()
#노이즈 추가하기
y[::1] +=1 * (0.5 - np.random.rand(400))
plt.scatter(X, y, s=20, edgecolor ="black", c ="darkorange", label="data")

# train-test split
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X,y,train_size =0.7, 
random_state =1)
print(train_x.shape, test_x.shape, train_y.shape, test_y.shape)

(280, 1) (120, 1) (280,) (120,)

# modeling
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)

from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error
import pandas as pd
import numpy as np
y_1 = regr_1.fit(train_x, train_y).predict(test_x)
y_2 = regr_2.fit(train_x, train_y).predict(test_x)
preds = [y_1, y_2]
weights = ["max depth = 2", "max depth = 5"]
evls = ['mse', 'rmse', 'mae']
results=pd.DataFrame(index =weights,columns =evls)
for pred, nm in zip(preds, weights):
 mse = mean_squared_error(test_y, pred)
 mae = mean_absolute_error(test_y, pred)
 rmse = np.sqrt(mse)
 
 results.loc[nm]['mse']=round(mse,2)
 results.loc[nm]['rmse']=round(rmse,2)
 results.loc[nm]['mae']=round(mae,2)
results

X_test = np.sort(5 * np.random.rand(40, 1), axis=0)
regrs=[regr_1, regr_2]
depths=["max depth = 2", "max depth = 5"]
model_color=["m", "c"]
fig, axes = plt.subplots(nrows =1, ncols =2, sharey =True, figsize =(13, 5))
for ix, regr in enumerate(regrs):
 pred = regr.fit(X,y).predict(X_test)
 r2 = regr.score(X_test, pred)
 mae=mean_absolute_error(X_test, pred)
 axes[ix].plot(X_test,
                pred,
                color=model_color[ix],
                label="{}".format(depths[ix])
                )
 axes[ix].scatter(X, y, 
                s=20, 
                edgecolor="gray", 
                c="darkorange", 
                label="data"
                )
 axes[ix].legend(loc="upper right",
                ncol=1,
                fancybox=True,
                shadow=True
                )
 axes[ix].set_title("R2 : {r} , MAE : {m}".format(r =round(r2,3), m=round(mae, 3)))
fig.text(0.5, 0.04, "data", ha ="center", va ="center")
fig.text(0.06, 0.5, "target", ha ="center", va ="center", rotation ="vertical")
fig.suptitle("Decision Tree Regression", fontsize =14)
plt.show()

Twitter Facebook LinkedIn

sigi

ADP 실기 9장 Decision Tree

1. Decision Tree

2. Decision Tree Classifier

2.1 Parameters

2.2 Attributes

2.3 Method

2.4 Implementation

3. Decision Tree Regressior

3.1 Parameters

3.2 Attribute

3.3 Methods

3.4 Implementation

공유하기

댓글남기기

참고

Llama 3.1 한국어 Finetuning

LLM의 활용 방법

k8s (3) Pod 정리

Continuous distribution (3) Gamma Distribution