7 분 소요

본 게시물은 Ensemble에 대해 소개한다.

1. Ensemble

1.1 Concept

단일 결정 트리의 단점을 극복하기 위해 머신러닝 모델을 연결하여 강한 모델을 만드는 방법

1.2 Bootstrap

  • 랜덤 샘플링의 일종
  • 단순임의복원추출(=중복허용)

1.3 Bagging

  • 주어진 자료를 모집단으로 보고 여러 개의 bootstrap 생성
  • 각 bootstrap 자료에 대해 예측 모형을 만든 후 결합
  • 보팅: 각 트리를 최대로 성장시킨 후 (=가지치기 안함, 오버피팅 고려 안함) 다수결
  • 특징
    • 각 bootstrap에 대해 병렬 수행
    • 분산이 적은 앙상블 모델을 얻음
  • OOB: 평균적으로 63%정도만 샘플링 되기에 나머지 37%를 통해 검증
    • 단 모델마다 남겨진 데이터는 다름

1.4 Boosting

  • 약한 모형(트리가 작은)을 결합해 강한 모형을 만듦
  • 병렬이 아닌 순차적 학습 진행
  • train에 대해 오류가 작으나 과적합 위험 있음

1.5 Random Forest

  • bagging과 boosting 보다 더 많은 무작위성을 주어 약한 모델 생성 후 선형결합
  • 수천 개의 변수를 제거 없이 모델링 하여 좋은 정확도
  • 결과 해석이 어려움
  • 입력변수가 많은 경우 배깅 및 부스팅과 비슷하거나 더 좋음

2. Bagging Classifier

2.1 Parameters

class sklearn.ensemble.BaggingClassifier(estimator=None, n_estimators=10, *, 
max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, 
oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)
  • estimator: 배깅에서 수행할 분류기 (default는 단일 D.T)
  • n_estimators: 모델 수
  • max_samples: 각 모델에 사용할 샘플 수 비율 (0~1)
  • max_features: 각 모델에 사용할 컬럼 비율 (0~1)
  • oob_score: 일반화 오류 추정을 위해 oob 샘플 사용 여부

2.2 Attributes

  • oob_score_: oob를 사용해 얻은 train 데이터의 점수, 매개변수가 True로 설정되어야 함

2.3 Methods

  • fit(X, y)
  • predict(X)
  • predict_proba(X)
  • score(X, y): 분류기이기에 예측의 정확도를 반환

2.4 Implementation

😗 데이터 불러오기

import pandas as pd
breast = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/breast-cancer.csv")
breast

# target 시각화
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure()
sns.countplot(x='diagnosis', data =breast)

# target - feature 상관관계 시각화
# area_maen, texture_mean과 diagnosis의 관계 확인
sns.relplot(x='area_mean', y="texture_mean", hue='diagnosis', data=breast)

# 범주형 변수 처리
import numpy as np
from sklearn.model_selection import train_test_split
breast["diagnosis"] = np.where(breast["diagnosis"]=="M", 1, 0)

# feature, target 설정
features = ["area_mean", "area_worst"]
X = breast[features]
y = breast["diagnosis"]

# train_test split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3, 
stratify =y, random_state =1)
print(x_train.shape, x_test.shape)
print(y_train.shape, y_test.shape)
(398, 2) (171, 2)
(398,) (171,)
# modeling
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
clf = BaggingClassifier(estimator =DecisionTreeClassifier())
pred = clf.fit(x_train, y_train).predict(x_test)
print("Accuracy Score : ", clf.score(x_test, y_test))
Accuracy Score :  0.9239766081871345
# 평가
from sklearn.metrics import confusion_matrix 
pd.DataFrame(confusion_matrix(y_test, pred),
 index=['True[0]', 'True[1]'],
 columns=['Pred[0]','Pred[1]'])

# ROC Curve, AUC Score
import matplotlib.pyplot as plt

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(x_test)[:, 1])
roc_auc = roc_auc_score(y_test, clf.predict_proba(x_test)[:, 1])

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()

print("ROC_AUC_score : ", roc_auc)

ROC_AUC_score :  0.9324620327102803
# oob score
clf_oob=BaggingClassifier(estimator =DecisionTreeClassifier(), 
                          n_estimators=50, 
                          oob_score=True)
oob=clf_oob.fit(X, y).oob_score_
print(oob)
0.9244288224956063

3. Bagging Regressor

3.1 Parameters

class sklearn.ensemble.BaggingRegressor(base_estimator=None, n_estimators=10, *, 
max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, 
oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0) 
  • classifier와 동일

3.2 Attributes

  • oob_score_: classifier와 동일

3.3 Methods

  • classifier와 동일

3.4 Implementation

😗 데이터 불러오기

import pandas as pd
car = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/CarPrice_Assignment.csv")
car.info()

# target, feature 설정
car_num = car.select_dtypes(['number'])
features = list(car_num.columns.difference(['car_ID', 'symboling', 'price']))
X=car_num[features]
y=car_num['price']
print(X.shape, y.shape)
(205, 13) (205,)
# oob score
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
reg = BaggingRegressor(estimator =DecisionTreeRegressor(),
                       n_estimators=50,
                        oob_score=True)
reg=reg.fit(X, y)
reg.oob_score_
0.9224681669421886

4. AdaBoost Classifier

4.1 Parameters

class sklearn.ensemble.AdaBoostClassifier(estimator=None, *, n_estimators=50, 
learning_rate=1.0, algorithm=SAMME.R, random_state=None) 
  • estimator: 모델 종류 None일시 D.T
  • n_esitmators: 종료 조건 (최대 모델 수)
  • learning_rate: 반복시 적용되는 가중치

4.2 Attributes

  • feature_importance_: 불순도 기반의 변수 중요도 출력

4.3 Methods

  • fit(X,y)
  • predict(X)
  • predict_proba(X)
  • score(X,y): 평균정확도

4.4 Implementation

😗 데이터 불러오기

import pandas as pd
breast = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/breast-cancer.csv")
import numpy as np
from sklearn.model_selection import train_test_split

# 범주형 변수 변환
breast["diagnosis"] = np.where(breast["diagnosis"]=="M", 1, 0)
features = ["area_mean", "texture_mean"]

# feature target 설정
X = breast[features]
y = breast["diagnosis"]

# data split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3, stratify =y, random_state =1)
print(x_train.shape, x_test.shape)
print(y_train.shape, y_test.shape)
(398, 2) (171, 2)
(398,) (171,)
# modeling & eval
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(estimator =None)
pred=clf.fit(x_train, y_train).predict(x_test)
print("정확도 : ", clf.score(x_test, y_test))
정확도 :  0.9122807017543859
# confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
pred=clf.predict(x_test)
test_cm=confusion_matrix(y_test, pred)
test_acc=accuracy_score(y_test, pred)
test_prc=precision_score(y_test, pred)
test_rcll=recall_score(y_test, pred)
test_f1=f1_score(y_test, pred)
print(test_cm)
print('정확도\t{}%'.format(round(test_acc *100,2)))
print('정밀도\t{}%'.format(round(test_prc *100,2)))
print('재현율\t{}%'.format(round(test_rcll *100,2)))
[[102   5]
 [ 10  54]]
정확도	91.23%
정밀도	91.53%
재현율	84.38%
# ROC, AUC
import matplotlib.pyplot as plt

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(x_test)[:, 1])
roc_auc = roc_auc_score(y_test, clf.predict_proba(x_test)[:, 1])

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()

print("ROC_AUC_score : ", roc_auc)

ROC_AUC_score :  0.9444363317757009
# feature importances
importances = clf.feature_importances_
column_nm = pd.DataFrame(["area_mean", "texture_mean"])
feature_importances = pd.concat([column_nm,
 pd.DataFrame(importances)],
 axis=1)
feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)
     feature_nm  importances
0     area_mean         0.56
1  texture_mean         0.44
# feature importance 시각화
f = features
xtick_label_position = list(range(len(f)))
plt.xticks(xtick_label_position, f)
plt.bar([x for x in range(len(importances))], importances)

5. AdaBoost Regressor

5.1 Parameters

class sklearn.ensemble.AdaBoostRegressor(base_estimator=None, *, 
n_estimators=50, learning_rate=1.0, loss=linear, random_state=None)
  • classifier와 동일

5.2 Attributes

  • classifier와 동일

5.3 Methods

  • classifier와 동일

5.4 Implementation

😗 데이터 불러오기

car = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/CarPrice_Assignment.csv")
# 전처리
car_num = car.select_dtypes(['number'])

# feature & target 설정
features = list(car_num.columns.difference(['car_ID', 'symboling', 'price']))
X=car_num[features]
y=car_num['price']

# train_test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3, random_state =1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(143, 13)
(62, 13)
(143,)
(62,)
# modeling
from sklearn.ensemble import AdaBoostRegressor
reg = AdaBoostRegressor(estimator =None)
pred=reg.fit(x_train, y_train).predict(x_test)

# eval
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error
mse = mean_squared_error(y_test, pred)
mae = mean_absolute_error(y_test, pred)
rmse = np.sqrt(mse)
acc = reg.score(x_test, y_test)
print('MSE\t{}'.format(round(mse,3)))
print('MAE\t{}'.format(round(mae,3)))
print('RMSE\t{}'.format(round(rmse,3)))
print('ACC\t{}%'.format(round(acc *100,3)))
MSE	6047513.193
MAE	1847.222
RMSE	2459.169
ACC	89.983%
# feature importance
importances = reg.feature_importances_
column_nm = pd.DataFrame(features)
feature_importances = pd.concat([column_nm,
 pd.DataFrame(importances)],
 axis=1)
feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)

# feature importance 시각화
n_features = x_train.shape[1]
importances = reg.feature_importances_
column_nm = features
plt.barh(range(n_features), importances, align ='center')
plt.yticks(np.arange(n_features), column_nm)
plt.xlabel("feature importances")
plt.ylabel("feature")
plt.ylim(-1, n_features)
plt.show()

6. Random Forest Classifier

6.1 Parameters

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion=gini, 
max_depth=None, min_samples_split=2, min_samples_leaf=1, 
min_weight_fraction_leaf=0.0, max_features=auto, max_leaf_nodes=None, 
min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, 
random_state=None, verbose=0, warm_start=False, class_weight=None, 
ccp_alpha=0.0, max_samples=None)
  • n_estimators: DT의 개수
  • criterion: 분할 여부를 판단하는 불순도 (gini, entropy)
  • max_depth: 나무의 최대 깊이, None이면 모든 노드의 불순도가 0이거나 min_sample_split 미만의 샘플만 존재할 때 까지 분할 확장
  • min_samples_split: 내부 노드를 분할하는데 필요한 최소 샘플 수
  • min_samples_leaf: 리프 노드에 있어야 하는 최소 샘플 수
  • max_leaf_nodes: 리프 노드의 최대 개수, None이면 제한이 없음
  • bootstrap: False면 모든 데이터 사용
  • oob:
  • ccp_aplha: pruning에 사용하는 파라미터로, 최소 비용-복잡성 정리에 사용됨, ccp_alpha보다 작은 비용-복잡성을 가진 서브트리중 가장 비용-복잡성이 큰 트리를 선택, None일 경우 pruning은 수행되지 않음
  • min_impurity_decrease: 노드가 분할되는 조건으로 해당 값보다 크거나 같은 수준으로 불순도가 감소할 경우 노드가 분할

6.2 Attributes

  • feature_importances_: 변수 중요도
  • oob_score_

6.3 Methods

  • fit(X,y)
  • predict(X)
  • predict_proba(X)
  • score(X,y)

6.4 Implementation

😗 데이터 불러오기

import pandas as pd
breast = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/breast-cancer.csv")
import numpy as np
from sklearn.model_selection import train_test_split
breast["diagnosis"] = np.where(breast["diagnosis"]=="M", 1, 0)
features = ["area_mean", "texture_mean"]
X = breast[features]
y = breast["diagnosis"]
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3, 
stratify =y, random_state =1)
print(x_train.shape, x_test.shape)
print(y_train.shape, y_test.shape)
(398, 2) (171, 2)
(398,) (171,)
from sklearn.ensemble import RandomForestClassifier 
clf = RandomForestClassifier(n_estimators =100, min_samples_split =5)
pred=clf.fit(x_train, y_train).predict(x_test)
print("정확도 : ", clf.score(x_test, y_test))
정확도 :  0.9005847953216374
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
pred=clf.predict(x_test)
test_cm=confusion_matrix(y_test, pred)
test_acc=accuracy_score(y_test, pred)
test_prc=precision_score(y_test, pred)
test_rcll=recall_score(y_test, pred)
test_f1=f1_score(y_test, pred)
print(test_cm)
print('정확도\t{}%'.format(round(test_acc *100,2)))
print('정밀도\t{}%'.format(round(test_prc *100,2)))
print('재현율\t{}%'.format(round(test_rcll *100,2)))
[[103   4]
 [ 13  51]]
정확도	90.06%
정밀도	92.73%
재현율	79.69%
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(x_test)[:, 1])
roc_auc = roc_auc_score(y_test, clf.predict_proba(x_test)[:, 1])

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()

print("ROC_AUC_score : ", roc_auc)

importances = clf.feature_importances_
column_nm = pd.DataFrame(["area_mean", "texture_mean"])
feature_importances = pd.concat([column_nm,
 pd.DataFrame(importances)],
 axis=1)
feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)
     feature_nm  importances
0     area_mean     0.687528
1  texture_mean     0.312472
f = features
xtick_label_position = list(range(len(f)))
plt.xticks(xtick_label_position, f)
plt.bar([x for x in range(len(importances))], importances)

7. Random Forest Regressor

7.1 Parameters

class sklearn.ensemble.RandomForestRegressor(n_estimators=100, *, 
criterion=squared_error, max_depth=None, min_samples_split=2, 
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=auto, max_leaf_
nodes=None, min_impurity_decrease=0.0, bootstrap=True, 
oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, 
ccp_alpha=0.0, max_samples=None)[source] 
  • classifier와 동일

7.2 Attributes

  • classifier와 동일

7.3 Methods

  • classifier와 동일

7.4 Implementation

😗 데이터 불러오기

car = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/CarPrice_Assignment.csv")
car_num = car.select_dtypes(['number'])
features = list(car_num.columns.difference(['car_ID', 'symboling', 'price']))
X=car_num[features]
y=car_num['price']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3, 
random_state =1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(143, 13)
(62, 13)
(143,)
(62,)
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor()
pred=reg.fit(x_train, y_train).predict(x_test)
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error
mse = mean_squared_error(y_test, pred)
mae = mean_absolute_error(y_test, pred)
rmse = np.sqrt(mse)
acc = reg.score(x_test, y_test)
print('MSE\t{}'.format(round(mse,3)))
print('MAE\t{}'.format(round(mae,3)))
print('RMSE\t{}'.format(round(rmse,3)))
print('ACC\t{}%'.format(round(acc *100,3)))
MSE	4171875.557
MAE	1333.243
RMSE	2042.517
ACC	93.09%
importances = reg.feature_importances_
column_nm = pd.DataFrame(features)
feature_importances = pd.concat([column_nm,
 pd.DataFrame(importances)],
 axis=1)
feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)
          feature_nm  importances
0          boreratio     0.005480
1          carheight     0.003741
2          carlength     0.009772
3           carwidth     0.017285
4            citympg     0.005848
5   compressionratio     0.003881
6         curbweight     0.183016
7         enginesize     0.663279
8         highwaympg     0.059726
9         horsepower     0.024374
10           peakrpm     0.006611
11            stroke     0.003564
12         wheelbase     0.013423
n_features = x_train.shape[1]
importances = reg.feature_importances_
column_nm = features
plt.barh(range(n_features), importances, align ='center')
plt.yticks(np.arange(n_features), column_nm)
plt.xlabel("feature importances")
plt.ylabel("feature")
plt.ylim(-1, n_features)
plt.show()

댓글남기기