ADP 실기 연습문제 (2)

2023-10-23 6 분 소요

본 게시물은 ADP 실기의 연습문제 풀이에 대한 것이다.

머신러닝

1. 시각화 포함 EDA 수행

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
df1 = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/diabetes_for_test.csv")

1.1.1 데이터 개요

df1

df1.describe()

df1.info()

Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)

📍 데이터 개요 해석

null 값이 없는 데이터 셋
수치형 데이터는 이상치가 있는지 유의해서 봐야 함
outcome은 Int형이지만 categorical 변수임
각 컬럼별 범위의 차이가 있기 때문에 스케일링 고려해야 함

1.1.2 시각화

fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(16, 16))

columns = df1.columns.to_list()

for i, column in enumerate(columns):
    row = i // 3
    col = i % 3
    axes[row, col].hist(df1[column], bins=20, color='skyblue', alpha=0.7)
    axes[row, col].set_title(f'Histogram of {column}')
    axes[row, col].set_xlabel(column)
    axes[row, col].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# 연속형
continuous_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

# 범주형
categorical_columns = ['Outcome']

👀 연속형-히스토그램

fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(15, 4))

for i, column in enumerate(continuous_columns):

    row = i // 4
    col = i % 4
    df1[column].plot(kind='hist', bins=20, ax=axes[row, col], color='skyblue', edgecolor='black')
    axes[row, col].set_title(f'Histogram of {column}')
    axes[row, col].set_xlabel(column)
    axes[row, col].set_ylabel('Frequency')

plt.tight_layout()

plt.show()

👀 연속형-박스플롯

fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(15, 4))

for i, column in enumerate(continuous_columns):

    row = i // 4
    col = i % 4
    sns.boxplot(x=df1[column], ax=axes[row, col], color='skyblue')
    axes[row, col].set_title(f'Box plot of {column}')
    axes[row, col].set_xlabel(column)
    axes[row, col].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

👀 범주형-막대 그래프

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(3, 3))

for i, column in enumerate(categorical_columns):

    df1[column].value_counts().sort_index().plot(kind='bar', ax=axes, color='skyblue')
    axes.set_title(f'Bar Plot of {column}')
    axes.set_xlabel(column)
    axes.set_ylabel('Frequency')    

plt.tight_layout()
plt.show()

👀 그룹별 데이터

classification 문제이기 때문에 그룹별로 데이터를 시각화하여 파악

diabetes = df1.groupby('Outcome').mean()
diabetes

Outcome에 따른 그룹별 평균 데이터

fig, axes = plt.subplots(2, 4, figsize=(16, 12))

columns = diabetes.columns.to_list()

for i, column in enumerate(columns):
    row = i // 4
    col = i % 4
    sns.barplot(x=diabetes.index, y=column, data=diabetes, ax=axes[row, col])
    axes[row, col].set_title(f'Average {column}')
    axes[row, col].set_xlabel("Outcome")
    axes[row, col].set_ylabel(f'Average {column}')

plt.tight_layout()
plt.show()

👀 상관관계 시각화

df_cor = df1.drop(columns=["Outcome"]).corr(method='pearson')
sns.heatmap(df_cor,
             xticklabels = df_cor.columns,
             yticklabels = df_cor.columns,
             cmap='RdBu_r',
             annot=True, 
             linewidth=3)

📍 시각화 결과 해석

연속형 변수의 히스토그램을 통해 Insulin과 Age의 분포가 한쪽에 많이 몰려있음을 알 수 있음
연속형 변수의 박스플롯을 통해 Outlier들이 존재함을 확인
범주형 변수의 막대그래프를 통해 클래스가 불균형함을 확인
Outcome이 1일때 대부분의 데이터가 높게 나타나는 경향이 있음
상관관계 분석을 통해 0.9 이상의 높은 값을 가지는 변수가 없음을 확인하였기에 선형 모델 사용시 모든 변수를 사용하여도 무방할 것으로 판단

1.2 이상치를 식별하고 처리

1.2.1 이상치 식별

👀 이상치 시각화-박스플롯

X = df1.drop(columns=['Outcome'])
df_v1 = pd.melt(X ,var_name='col', value_name='value')
df_v1

plt.figure(figsize = (15, 7))
sns.boxplot(x ='col', y ='value', data = df_v1)
plt.xticks(range(8), X.columns)
plt.show()

👀 이상치 비율 확인

out_Glucose = df1[df1.Glucose == 0]
out_Blood = df1[df1.BloodPressure == 0]
out_Age = df1[df1.Age > 100]

sorted_df = df1.sort_values(by='Insulin', ascending=False)
total_rows = len(sorted_df)
top_percent = int(total_rows * 0.005)
out_Insulin = sorted_df.head(top_percent)

df_out = pd.DataFrame({"num": [len(out_Glucose), len(out_Blood), len(out_Age), len(out_Insulin)], "index":["Glucose", "BloodPressure", "Age", "Insulin"]})
df_out['percent'] = df_out.apply(lambda x: round(x.num/len(df1), 3)*100, axis=1)
df_out.set_index("index")

📍 이상치 식별 결과 해석

AGE를 제외한 나머지 데이터들은 연속적인 값을 가지기에 정확한 이상치를 파악하기 위해서는 현업의 의견이 필요하나 개인적인 판단으로 이상치 정의
- Glucose, BloodPressure, BMI 등 0이 될 수 없는 값은 이상치로 판단
- Insuline은 상위 0.5%인 경우 이상치로 판단
- AGE는 100이 이상인 경우 이상치로 판단
이상치 식별 결과 BloodPressure를 제외한 값들은 매우 작은 비율이기에 제거
BloodPressure는 Null로 대체 후 KNN을 사용하여 이상치 처리

1.2.2 이상치 처리

👀 인덱스를 통한 이상치 제거 및 대체

out_index = set(out_Glucose.index.tolist()).union(set(out_Age.index.tolist()))
out_index = list(out_index.union(set(out_Insulin.index.tolist())))
df1 = df1.drop(out_index)

이상치 제거

out_blood_index = list(set(out_Blood.index.tolist()).difference(set(out_index)))
df1.loc[out_blood_index, 'BloodPressure'] = None
df1.reset_index(drop=True, inplace=True)

이상치 대체(=결측치 생성)

👀 KNN을 이용한 결측치 대체

from sklearn.impute import KNNImputer

imputer = KNNImputer()
df_filled = imputer.fit_transform(df1)
df_filled = pd.DataFrame(df_filled, columns=df1.columns)
df1[df1.columns] = df_filled
df1.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0

👀 이상치 처리 시각화-박스플롯

X = df1.drop(columns=['Outcome'])
df_v1 = pd.melt(X ,var_name='col', value_name='value')
df_v1

plt.figure(figsize = (15, 7))
sns.boxplot(x ='col', y ='value', data = df_v1)
plt.xticks(range(8), X.columns)
plt.show()

2. 클래스 불균형 처리

👀 클래스 불균형 식별

df1['Outcome'].value_counts()

0.0    492
1.0    265
Name: Outcome, dtype: int64

☀️ OverSampling

from imblearn.over_sampling import RandomOverSampler 

X = df1.drop(['Outcome'],axis=1)
y = df1[['Outcome']]
ros = RandomOverSampler()
X_upsampling,y_upsampling = ros.fit_resample(X,y)
print('기존의 타깃 분포')
print(df1['Outcome'].value_counts()/len(df1))
print('-'*10)
print('upsampling의 타깃 분포')
print(y_upsampling['Outcome'].value_counts()/len(y_upsampling))

기존의 타깃 분포
0.0    0.649934
1.0    0.350066
Name: Outcome, dtype: float64
----------
upsampling의 타깃 분포
1.0    0.5
0.0    0.5
Name: Outcome, dtype: float64

☀️ UnderSampling

from imblearn.under_sampling import RandomUnderSampler
rus = RandomOverSampler()
X_undersampling,y_undersampling = rus.fit_resample(X,y)
print('기존의 타깃 분포')
print(df1['Outcome'].value_counts()/len(df1))
print('-'*10)
print('undersampling의 타깃 분포')
print(y_undersampling['Outcome'].value_counts()/len(y_undersampling))

기존의 타깃 분포
0.0    0.649934
1.0    0.350066
Name: Outcome, dtype: float64
----------
undersampling의 타깃 분포
1.0    0.5
0.0    0.5
Name: Outcome, dtype: float64

3. 모델링

3.1 다양한 모델링 결과 도출

from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
import sklearn.svm as svm

log = LogisticRegression()
xgb = XGBClassifier(random_state=0)
svm_clf =svm.SVC(kernel ='linear')

from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
import time
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=0)
## 5개의 경우의 수로 분할하여 검증 
kfold = KFold()

def model_result(model):
    pred_li =[]
    
    for train_index,test_index in kfold.split(X):
        X_train,X_test = X.iloc[train_index,:],X.iloc[test_index,:]
        y_train,y_test = y.iloc[train_index,:],y.iloc[test_index,:]

        X_train_resample,y_train_resample = smote.fit_resample(X_train,y_train)

        start = time.time()
        model.fit(X_train_resample,y_train_resample)
        end = time.time()

        pred = model.predict(X_test)
        pred_li.append(accuracy_score(pred,y_test['Outcome']))

    ## 마지막 데이터 학습 속도 
    print(f"{end - start:.5f} sec")
    ## 5개의 train데이터에 대한 정확도의 평균 값 
    print(np.mean(pred_li))

model_result(log)

0.01135 sec
0.7306117113976995

model_result(xgb)

0.18677 sec
0.7450766817706518

model_result(svm_clf)

1.37559 sec
0.7557162774485884

3.2 차원축소를 사용한 모델링

☀️ PCA를 통한 차원 축소

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_train = train_test_split(X,y, stratify=y, test_size=0.3, random_state=2022)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
pca = PCA(n_components=8)
X_train_pca = pca.fit(X_train_s)
print(pca.explained_variance_ratio_)

for i in range(len(pca.explained_variance_ratio_)):
    print(i, pca.explained_variance_ratio_[:i].sum())

[0.25906554 0.22771341 0.12534153 0.11213758 0.10048032 0.07434416
05278052 0.04813696]
0.0
0.2590655358872176
0.4867789432683496
0.6121204775600299
0.7242580556898799
0.824738371243592
0.8990825273257718
0.9518630446620266

주성분 분석을 통해 5개의 주성분 사용시 총 데이터의 82%를 표현할 수 있음을 확인

def pca_model_result(model):
    pred_li =[]
    smote = SMOTE(random_state=0)
    
    for train_index,test_index in kfold.split(X):
        X_train,X_test = X.iloc[train_index,:],X.iloc[test_index,:]
        y_train,y_test = y.iloc[train_index,:],y.iloc[test_index,:]

        X_train_resample,y_train_resample = smote.fit_resample(X_train,y_train)

        scaler = StandardScaler()
        X_train_res_s = scaler.fit_transform(X_train_resample)
        X_test_s = scaler.transform(X_test)

        pca = PCA(n_components=5)
        X_train_pca = pca.fit_transform(X_train_res_s)
        X_test_pca = pca.transform(X_test_s)

        start = time.time()
        model.fit(X_train_pca,y_train_resample)
        end = time.time()

        pred = model.predict(X_test_pca)
        pred_li.append(accuracy_score(pred,y_test['Outcome']))


    print(f"{end - start:.5f} sec")
    print(np.mean(pred_li))

pca_model_result(log)

0.00185 sec
0.7319100731962357

pca_model_result(xgb)

0.20892 sec
0.7107877309166957

pca_model_result(svm_clf)

0.01554 sec
0.7279365632624608

예측 성능은 떨어졌지만 수행 속도가 빨라짐을 확인

통계분석

1. 연관성 분석

상품 a와 b가 있을 때 다음과 같은 구매 패턴이 있다고 한다.

[‘a’,‘a’,‘b’,‘b’,‘a’,‘a’,‘a’,‘a’,‘b’,‘b’,‘b’,‘b’,‘b’,‘a’,‘a’,‘b’,‘b’,‘a’,‘b’,‘b’]

1.1 구매 패턴으로 볼 때 두 상품이 연관이 있는지 가설을 세우고 검정하시오.

✏️ 문제 정리

이진항목에 대한 연관성을 검정하는 것으로 one sample run test를 수행함

📍 가설 수립

H0: 연속적인 관측이 연관성이 없다.
H1: 연속적인 관측이 연관성이 있다.

import pandas as pd 
data = ['a','a','b','b','a','a','a','a','b','b','b','b','b','a','a','b','b','a','b','b']
test_df = pd.DataFrame(data,columns=['product'])
test_df.loc[test_df['product']=='a','product'] =1
test_df.loc[test_df['product']=='b','product'] =0
test_df['product']

from statsmodels.sandbox.stats.runs import runstest_1samp
runstest_1samp(test_df['product'])

(-1.1144881152070183, 0.26506984027306035)

📍 one sample run test 해석

통계량: -1.1144881152070183
p-value: 0.26506984027306035
p-value가 0.05보다 크므로 귀무가설 채택
구매 이력에 연관성이 없음

2. 코딩 테스트

제품 1, 2를 만드는 데 재료 a, b, c가 일부 사용되며, 제품 1과 2를 만들 때 12만원과 18만원을 벌 수 있다. 재료는 한정적으로 주어지는데, 이때 최대 수익을 낼 수 있을 때의 제품 1과 제품 2의 개수를 구하라.

재료 공급량 ☞ {a:1300, b:1000, c:1200}

data = pd.DataFrame({"a": [20, 40], "b":[20, 30], "c":[20,30]}, index=[1,2])
data

✏️ 문제 정의

선형 계획법으로 풀이
제품 1: 12만원 / 제품 2: 18만원

from pulp import LpProblem, LpVariable, LpMaximize

# 문제 정의 (최대화 문제)
prob = LpProblem("MaximizeProfit", LpMaximize)

# 변수 정의 (제품 1과 제품 2의 개수, 0 이상의 정수)
x1 = LpVariable("Product1", 0, None, 'Integer')
x2 = LpVariable("Product2", 0, None, 'Integer')

# 목적함수 추가 (제품 1과 제품 2로부터의 수익 최대화)
prob += 120000 * x1 + 180000 * x2

# 제약조건 추가 (재료 a, b, c의 공급량 한정)
prob += 20 * x1 + 40 * x2 <= 1300, "Material A Constraint"
prob += 20 * x1 + 30 * x2 <= 1000, "Material B Constraint"
prob += 20 * x1 + 30 * x2 <= 1200, "Material C Constraint"

# 문제 풀이
prob.solve()

# 결과 출력
print("Product1: ", int(x1.varValue))
print("Product2: ", int(x2.varValue))
print("Max Profit: ", int(prob.objective.value()))

Product1:  5
Product2:  30
Max Profit:  6000000

Twitter Facebook LinkedIn

sigi