ADP 실기 9장 Decision Tree
본 게시물은 Decision Tree에 대해 소개한다.
1. Decision Tree
데이터를 분류하고 예측하기 위해 특정 기준에 따라 ‘Yes/No’로 대답할 수 있는 분기를 두도록 학습
- 선형성, 정규성 등 가정이 필요하지 않아 전처리 과정에서 모델 성능 영향이 크지 않음
- 연속형, 범주형 모두 사용 가능
- 분류 규칙 명확하여 해석이 쉬움
2. Decision Tree Classifier
2.1 Parameters
class sklearn.tree.DecisionTreeClassifier(*, criterion=‘gini’, splitter=‘best’,
max_depth=None, min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_features=None, random_state=None,
max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None,
ccp_alpha=0.0)
- criterion: 노드 분할시 사용 함수
- max_depth: 트리의 최대 깊이
- min_samples_split: 노드 분할시 필요한 최소 데이터 수
- min_samples_leaf: 리프노드에 있어야할 최소 데이터 수
- max_features: 노드 분할시 사용할 feature의 수, None or auto일 경우 원본 데이터 개수를 사용
- max_leaf_node: 리프노드의 최대 데이터 수
- max_impurity_decrease: 노드가 분할되는 조건으로 해당 값보다 작거나 같은 수준으로 복잡도가 감소할 시 분할
- ccp_alpha: pruning에 사용되는 parameter로 ccp-alpha보다 작은 비용 - 복잡성을 가진 서브트리 중 가장 비용 - 복잡성이 큰 트리를 선택함, 0일경우 prune하지 않음
2.2 Attributes
- feature_importances_: 변수 중요도를 반화
2.3 Method
- cost_complexity_pruning_path
- X : 샘플데이터. 2차원 array 형태로 입력
- y : 타깃데이터. (n_samples,) 또는 (n_samples, n_targets) 형태로 입력
- 딕셔너리 형태의 ccp_path가 반환. ccp_alphas는 가지치기 동안의 서브트리에 대한 effective alpha값이며, impurities는 ccp_alpha값에 상응하는 서브트리 리프들의 불순도 합
2.4 Implementation
😗 데이터 불러오기
import pandas as pd
credit = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/credit_final.csv")
credit
# feature & target 생성
feature_columns = list(credit.columns.difference(['credit.rating']))
X = credit[feature_columns]
y = credit['credit.rating']
# train-test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify =y,
test_size =0.3, random_state =1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(700, 20)
(300, 20)
(700,)
(300,)
# modeling
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier(max_depth =5)
clf.fit(x_train, y_train)
# 평가
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
pred=clf.predict(x_test)
test_cm=confusion_matrix(y_test, pred)
test_acc=accuracy_score(y_test, pred)
test_prc=precision_score(y_test, pred)
test_rcll=recall_score(y_test, pred)
test_f1=f1_score(y_test, pred)
print(test_cm)
print('\n')
print('정확도\t{}%'.format(round(test_acc *100,2)))
print('정밀도\t{}%'.format(round(test_prc *100,2)))
print('재현율\t{}%'.format(round(test_rcll *100,2)))
[[ 28 62]
[ 27 183]]
정확도 70.33%
정밀도 74.69%
재현율 87.14%
# 평가표
from sklearn.metrics import classification_report
report = classification_report(y_test, pred)
print(report)
precision recall f1-score support
0 0.51 0.31 0.39 90
1 0.75 0.87 0.80 210
accuracy 0.70 300
macro avg 0.63 0.59 0.60 300
weighted avg 0.68 0.70 0.68 300
# roc
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(x_test)[:, 1])
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()
# auc
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_test, clf.predict_proba(x_test)[:, 1])
print("ROC_AUC_score : ", roc_auc)
ROC_AUC_score : 0.717857142857143
# feature importance
importances = clf.feature_importances_
column_nm = pd.DataFrame(X.columns)
feature_importances = pd.concat([column_nm, pd.DataFrame(importances)], axis=1)
feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)
feature_nm importances
0 account.balance 0.263282
1 age 0.112494
2 apartment.type 0.021665
3 bank.credits 0.000000
4 credit.amount 0.095584
5 credit.duration.months 0.187908
6 credit.purpose 0.059083
7 current.assets 0.000000
8 dependents 0.000000
9 employment.duration 0.000000
10 foreign.worker 0.000000
11 guarantor 0.011790
12 installment.rate 0.000000
13 marital.status 0.016325
14 occupation 0.000000
15 other.credits 0.034003
16 previous.credit.payment.status 0.123825
17 residence.duration 0.020960
18 savings 0.053080
19 telephone 0.000000
# feature importance 시각화
import os
os.environ["PATH"] += os.pathsep + '/path/to/graphviz/bin'
import numpy as np
feature_names = feature_columns
target_names = np.array(['0', '1'])
import pydot
import pydotplus
import graphviz
from sklearn.tree import export_graphviz
dt_dot_data = export_graphviz(clf, feature_names = feature_names,
class_names = target_names,
filled=True, rounded =True,
special_characters=True)
dt_graph=pydotplus.graph_from_dot_data(dt_dot_data)
from IPython.display import Image
Image(dt_graph.create_png())
3. Decision Tree Regressior
3.1 Parameters
class sklearn.tree.DecisionTreeRegressor(*, criterion=‘squared_error’, splitter=‘best’,
max_depth=None, min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, ccp_alpha=0.0)
- classifier와 동일
3.2 Attribute
- classifier와 동일
3.3 Methods
- classifier와 동일
3.4 Implementation
😗 데이터 불러오기
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
np.random.seed(0)
X = np.sort(5 * np.random.rand(400, 1), axis =0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()
#노이즈 추가하기
y[::1] +=1 * (0.5 - np.random.rand(400))
plt.scatter(X, y, s=20, edgecolor ="black", c ="darkorange", label="data")
# train-test split
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X,y,train_size =0.7,
random_state =1)
print(train_x.shape, test_x.shape, train_y.shape, test_y.shape)
(280, 1) (120, 1) (280,) (120,)
# modeling
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error
import pandas as pd
import numpy as np
y_1 = regr_1.fit(train_x, train_y).predict(test_x)
y_2 = regr_2.fit(train_x, train_y).predict(test_x)
preds = [y_1, y_2]
weights = ["max depth = 2", "max depth = 5"]
evls = ['mse', 'rmse', 'mae']
results=pd.DataFrame(index =weights,columns =evls)
for pred, nm in zip(preds, weights):
mse = mean_squared_error(test_y, pred)
mae = mean_absolute_error(test_y, pred)
rmse = np.sqrt(mse)
results.loc[nm]['mse']=round(mse,2)
results.loc[nm]['rmse']=round(rmse,2)
results.loc[nm]['mae']=round(mae,2)
results
X_test = np.sort(5 * np.random.rand(40, 1), axis=0)
regrs=[regr_1, regr_2]
depths=["max depth = 2", "max depth = 5"]
model_color=["m", "c"]
fig, axes = plt.subplots(nrows =1, ncols =2, sharey =True, figsize =(13, 5))
for ix, regr in enumerate(regrs):
pred = regr.fit(X,y).predict(X_test)
r2 = regr.score(X_test, pred)
mae=mean_absolute_error(X_test, pred)
axes[ix].plot(X_test,
pred,
color=model_color[ix],
label="{}".format(depths[ix])
)
axes[ix].scatter(X, y,
s=20,
edgecolor="gray",
c="darkorange",
label="data"
)
axes[ix].legend(loc="upper right",
ncol=1,
fancybox=True,
shadow=True
)
axes[ix].set_title("R2 : {r} , MAE : {m}".format(r =round(r2,3), m=round(mae, 3)))
fig.text(0.5, 0.04, "data", ha ="center", va ="center")
fig.text(0.06, 0.5, "target", ha ="center", va ="center", rotation ="vertical")
fig.suptitle("Decision Tree Regression", fontsize =14)
plt.show()
댓글남기기