ADP 실기 10장 Ensemble
본 게시물은 Ensemble에 대해 소개한다.
1. Ensemble
1.1 Concept
단일 결정 트리의 단점을 극복하기 위해 머신러닝 모델을 연결하여 강한 모델을 만드는 방법
1.2 Bootstrap
- 랜덤 샘플링의 일종
- 단순임의복원추출(=중복허용)
1.3 Bagging
- 주어진 자료를 모집단으로 보고 여러 개의 bootstrap 생성
- 각 bootstrap 자료에 대해 예측 모형을 만든 후 결합
- 보팅: 각 트리를 최대로 성장시킨 후 (=가지치기 안함, 오버피팅 고려 안함) 다수결
- 특징
- 각 bootstrap에 대해 병렬 수행
- 분산이 적은 앙상블 모델을 얻음
- OOB: 평균적으로 63%정도만 샘플링 되기에 나머지 37%를 통해 검증
- 단 모델마다 남겨진 데이터는 다름
1.4 Boosting
- 약한 모형(트리가 작은)을 결합해 강한 모형을 만듦
- 병렬이 아닌 순차적 학습 진행
- train에 대해 오류가 작으나 과적합 위험 있음
1.5 Random Forest
- bagging과 boosting 보다 더 많은 무작위성을 주어 약한 모델 생성 후 선형결합
- 수천 개의 변수를 제거 없이 모델링 하여 좋은 정확도
- 결과 해석이 어려움
- 입력변수가 많은 경우 배깅 및 부스팅과 비슷하거나 더 좋음
2. Bagging Classifier
2.1 Parameters
class sklearn.ensemble.BaggingClassifier(estimator=None, n_estimators=10, *,
max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False,
oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)
- estimator: 배깅에서 수행할 분류기 (default는 단일 D.T)
- n_estimators: 모델 수
- max_samples: 각 모델에 사용할 샘플 수 비율 (0~1)
- max_features: 각 모델에 사용할 컬럼 비율 (0~1)
- oob_score: 일반화 오류 추정을 위해 oob 샘플 사용 여부
2.2 Attributes
- oob_score_: oob를 사용해 얻은 train 데이터의 점수, 매개변수가 True로 설정되어야 함
2.3 Methods
- fit(X, y)
- predict(X)
- predict_proba(X)
- score(X, y): 분류기이기에 예측의 정확도를 반환
2.4 Implementation
😗 데이터 불러오기
import pandas as pd
breast = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/breast-cancer.csv")
breast
# target 시각화
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure()
sns.countplot(x='diagnosis', data =breast)
# target - feature 상관관계 시각화
# area_maen, texture_mean과 diagnosis의 관계 확인
sns.relplot(x='area_mean', y="texture_mean", hue='diagnosis', data=breast)
# 범주형 변수 처리
import numpy as np
from sklearn.model_selection import train_test_split
breast["diagnosis"] = np.where(breast["diagnosis"]=="M", 1, 0)
# feature, target 설정
features = ["area_mean", "area_worst"]
X = breast[features]
y = breast["diagnosis"]
# train_test split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3,
stratify =y, random_state =1)
print(x_train.shape, x_test.shape)
print(y_train.shape, y_test.shape)
(398, 2) (171, 2)
(398,) (171,)
# modeling
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
clf = BaggingClassifier(estimator =DecisionTreeClassifier())
pred = clf.fit(x_train, y_train).predict(x_test)
print("Accuracy Score : ", clf.score(x_test, y_test))
Accuracy Score : 0.9239766081871345
# 평가
from sklearn.metrics import confusion_matrix
pd.DataFrame(confusion_matrix(y_test, pred),
index=['True[0]', 'True[1]'],
columns=['Pred[0]','Pred[1]'])
# ROC Curve, AUC Score
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(x_test)[:, 1])
roc_auc = roc_auc_score(y_test, clf.predict_proba(x_test)[:, 1])
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()
print("ROC_AUC_score : ", roc_auc)
ROC_AUC_score : 0.9324620327102803
# oob score
clf_oob=BaggingClassifier(estimator =DecisionTreeClassifier(),
n_estimators=50,
oob_score=True)
oob=clf_oob.fit(X, y).oob_score_
print(oob)
0.9244288224956063
3. Bagging Regressor
3.1 Parameters
class sklearn.ensemble.BaggingRegressor(base_estimator=None, n_estimators=10, *,
max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False,
oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)
- classifier와 동일
3.2 Attributes
- oob_score_: classifier와 동일
3.3 Methods
- classifier와 동일
3.4 Implementation
😗 데이터 불러오기
import pandas as pd
car = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/CarPrice_Assignment.csv")
car.info()
# target, feature 설정
car_num = car.select_dtypes(['number'])
features = list(car_num.columns.difference(['car_ID', 'symboling', 'price']))
X=car_num[features]
y=car_num['price']
print(X.shape, y.shape)
(205, 13) (205,)
# oob score
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
reg = BaggingRegressor(estimator =DecisionTreeRegressor(),
n_estimators=50,
oob_score=True)
reg=reg.fit(X, y)
reg.oob_score_
0.9224681669421886
4. AdaBoost Classifier
4.1 Parameters
class sklearn.ensemble.AdaBoostClassifier(estimator=None, *, n_estimators=50,
learning_rate=1.0, algorithm=‘SAMME.R’, random_state=None)
- estimator: 모델 종류 None일시 D.T
- n_esitmators: 종료 조건 (최대 모델 수)
- learning_rate: 반복시 적용되는 가중치
4.2 Attributes
- feature_importance_: 불순도 기반의 변수 중요도 출력
4.3 Methods
- fit(X,y)
- predict(X)
- predict_proba(X)
- score(X,y): 평균정확도
4.4 Implementation
😗 데이터 불러오기
import pandas as pd
breast = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/breast-cancer.csv")
import numpy as np
from sklearn.model_selection import train_test_split
# 범주형 변수 변환
breast["diagnosis"] = np.where(breast["diagnosis"]=="M", 1, 0)
features = ["area_mean", "texture_mean"]
# feature target 설정
X = breast[features]
y = breast["diagnosis"]
# data split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3, stratify =y, random_state =1)
print(x_train.shape, x_test.shape)
print(y_train.shape, y_test.shape)
(398, 2) (171, 2)
(398,) (171,)
# modeling & eval
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(estimator =None)
pred=clf.fit(x_train, y_train).predict(x_test)
print("정확도 : ", clf.score(x_test, y_test))
정확도 : 0.9122807017543859
# confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
pred=clf.predict(x_test)
test_cm=confusion_matrix(y_test, pred)
test_acc=accuracy_score(y_test, pred)
test_prc=precision_score(y_test, pred)
test_rcll=recall_score(y_test, pred)
test_f1=f1_score(y_test, pred)
print(test_cm)
print('정확도\t{}%'.format(round(test_acc *100,2)))
print('정밀도\t{}%'.format(round(test_prc *100,2)))
print('재현율\t{}%'.format(round(test_rcll *100,2)))
[[102 5]
[ 10 54]]
정확도 91.23%
정밀도 91.53%
재현율 84.38%
# ROC, AUC
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(x_test)[:, 1])
roc_auc = roc_auc_score(y_test, clf.predict_proba(x_test)[:, 1])
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()
print("ROC_AUC_score : ", roc_auc)
ROC_AUC_score : 0.9444363317757009
# feature importances
importances = clf.feature_importances_
column_nm = pd.DataFrame(["area_mean", "texture_mean"])
feature_importances = pd.concat([column_nm,
pd.DataFrame(importances)],
axis=1)
feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)
feature_nm importances
0 area_mean 0.56
1 texture_mean 0.44
# feature importance 시각화
f = features
xtick_label_position = list(range(len(f)))
plt.xticks(xtick_label_position, f)
plt.bar([x for x in range(len(importances))], importances)
5. AdaBoost Regressor
5.1 Parameters
class sklearn.ensemble.AdaBoostRegressor(base_estimator=None, *,
n_estimators=50, learning_rate=1.0, loss=‘linear’, random_state=None)
- classifier와 동일
5.2 Attributes
- classifier와 동일
5.3 Methods
- classifier와 동일
5.4 Implementation
😗 데이터 불러오기
car = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/CarPrice_Assignment.csv")
# 전처리
car_num = car.select_dtypes(['number'])
# feature & target 설정
features = list(car_num.columns.difference(['car_ID', 'symboling', 'price']))
X=car_num[features]
y=car_num['price']
# train_test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3, random_state =1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(143, 13)
(62, 13)
(143,)
(62,)
# modeling
from sklearn.ensemble import AdaBoostRegressor
reg = AdaBoostRegressor(estimator =None)
pred=reg.fit(x_train, y_train).predict(x_test)
# eval
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error
mse = mean_squared_error(y_test, pred)
mae = mean_absolute_error(y_test, pred)
rmse = np.sqrt(mse)
acc = reg.score(x_test, y_test)
print('MSE\t{}'.format(round(mse,3)))
print('MAE\t{}'.format(round(mae,3)))
print('RMSE\t{}'.format(round(rmse,3)))
print('ACC\t{}%'.format(round(acc *100,3)))
MSE 6047513.193
MAE 1847.222
RMSE 2459.169
ACC 89.983%
# feature importance
importances = reg.feature_importances_
column_nm = pd.DataFrame(features)
feature_importances = pd.concat([column_nm,
pd.DataFrame(importances)],
axis=1)
feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)
# feature importance 시각화
n_features = x_train.shape[1]
importances = reg.feature_importances_
column_nm = features
plt.barh(range(n_features), importances, align ='center')
plt.yticks(np.arange(n_features), column_nm)
plt.xlabel("feature importances")
plt.ylabel("feature")
plt.ylim(-1, n_features)
plt.show()
6. Random Forest Classifier
6.1 Parameters
class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion=‘gini’,
max_depth=None, min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_features=‘auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None,
random_state=None, verbose=0, warm_start=False, class_weight=None,
ccp_alpha=0.0, max_samples=None)
- n_estimators: DT의 개수
- criterion: 분할 여부를 판단하는 불순도 (gini, entropy)
- max_depth: 나무의 최대 깊이, None이면 모든 노드의 불순도가 0이거나 min_sample_split 미만의 샘플만 존재할 때 까지 분할 확장
- min_samples_split: 내부 노드를 분할하는데 필요한 최소 샘플 수
- min_samples_leaf: 리프 노드에 있어야 하는 최소 샘플 수
- max_leaf_nodes: 리프 노드의 최대 개수, None이면 제한이 없음
- bootstrap: False면 모든 데이터 사용
- oob:
- ccp_aplha: pruning에 사용하는 파라미터로, 최소 비용-복잡성 정리에 사용됨, ccp_alpha보다 작은 비용-복잡성을 가진 서브트리중 가장 비용-복잡성이 큰 트리를 선택, None일 경우 pruning은 수행되지 않음
- min_impurity_decrease: 노드가 분할되는 조건으로 해당 값보다 크거나 같은 수준으로 불순도가 감소할 경우 노드가 분할
6.2 Attributes
- feature_importances_: 변수 중요도
- oob_score_
6.3 Methods
- fit(X,y)
- predict(X)
- predict_proba(X)
- score(X,y)
6.4 Implementation
😗 데이터 불러오기
import pandas as pd
breast = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/breast-cancer.csv")
import numpy as np
from sklearn.model_selection import train_test_split
breast["diagnosis"] = np.where(breast["diagnosis"]=="M", 1, 0)
features = ["area_mean", "texture_mean"]
X = breast[features]
y = breast["diagnosis"]
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3,
stratify =y, random_state =1)
print(x_train.shape, x_test.shape)
print(y_train.shape, y_test.shape)
(398, 2) (171, 2)
(398,) (171,)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators =100, min_samples_split =5)
pred=clf.fit(x_train, y_train).predict(x_test)
print("정확도 : ", clf.score(x_test, y_test))
정확도 : 0.9005847953216374
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
pred=clf.predict(x_test)
test_cm=confusion_matrix(y_test, pred)
test_acc=accuracy_score(y_test, pred)
test_prc=precision_score(y_test, pred)
test_rcll=recall_score(y_test, pred)
test_f1=f1_score(y_test, pred)
print(test_cm)
print('정확도\t{}%'.format(round(test_acc *100,2)))
print('정밀도\t{}%'.format(round(test_prc *100,2)))
print('재현율\t{}%'.format(round(test_rcll *100,2)))
[[103 4]
[ 13 51]]
정확도 90.06%
정밀도 92.73%
재현율 79.69%
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(x_test)[:, 1])
roc_auc = roc_auc_score(y_test, clf.predict_proba(x_test)[:, 1])
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()
print("ROC_AUC_score : ", roc_auc)
importances = clf.feature_importances_
column_nm = pd.DataFrame(["area_mean", "texture_mean"])
feature_importances = pd.concat([column_nm,
pd.DataFrame(importances)],
axis=1)
feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)
feature_nm importances
0 area_mean 0.687528
1 texture_mean 0.312472
f = features
xtick_label_position = list(range(len(f)))
plt.xticks(xtick_label_position, f)
plt.bar([x for x in range(len(importances))], importances)
7. Random Forest Regressor
7.1 Parameters
class sklearn.ensemble.RandomForestRegressor(n_estimators=100, *,
criterion=‘squared_error’, max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=‘auto’, max_leaf_
nodes=None, min_impurity_decrease=0.0, bootstrap=True,
oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False,
ccp_alpha=0.0, max_samples=None)[source]
- classifier와 동일
7.2 Attributes
- classifier와 동일
7.3 Methods
- classifier와 동일
7.4 Implementation
😗 데이터 불러오기
car = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/CarPrice_Assignment.csv")
car_num = car.select_dtypes(['number'])
features = list(car_num.columns.difference(['car_ID', 'symboling', 'price']))
X=car_num[features]
y=car_num['price']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3,
random_state =1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(143, 13)
(62, 13)
(143,)
(62,)
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor()
pred=reg.fit(x_train, y_train).predict(x_test)
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error
mse = mean_squared_error(y_test, pred)
mae = mean_absolute_error(y_test, pred)
rmse = np.sqrt(mse)
acc = reg.score(x_test, y_test)
print('MSE\t{}'.format(round(mse,3)))
print('MAE\t{}'.format(round(mae,3)))
print('RMSE\t{}'.format(round(rmse,3)))
print('ACC\t{}%'.format(round(acc *100,3)))
MSE 4171875.557
MAE 1333.243
RMSE 2042.517
ACC 93.09%
importances = reg.feature_importances_
column_nm = pd.DataFrame(features)
feature_importances = pd.concat([column_nm,
pd.DataFrame(importances)],
axis=1)
feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)
feature_nm importances
0 boreratio 0.005480
1 carheight 0.003741
2 carlength 0.009772
3 carwidth 0.017285
4 citympg 0.005848
5 compressionratio 0.003881
6 curbweight 0.183016
7 enginesize 0.663279
8 highwaympg 0.059726
9 horsepower 0.024374
10 peakrpm 0.006611
11 stroke 0.003564
12 wheelbase 0.013423
n_features = x_train.shape[1]
importances = reg.feature_importances_
column_nm = features
plt.barh(range(n_features), importances, align ='center')
plt.yticks(np.arange(n_features), column_nm)
plt.xlabel("feature importances")
plt.ylabel("feature")
plt.ylim(-1, n_features)
plt.show()
댓글남기기