ADP 실기 연습문제 (1)

2023-10-21 7 분 소요

본 게시물은 ADP 실기의 연습문제 풀이에 대한 것이다.

머신러닝

1. 시각화 포함 EDA 수행

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("https://raw.githubusercontent.com/sigirace/page-images/main/adp/problem/practice1/data/student_data.csv")

1.1 데이터 개요

df

df.describe()

df.info()

RangeIndex: 395 entries, 0 to 394
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 school      395 non-null    object 
 sex         395 non-null    object 
 paid        395 non-null    object 
 activities  395 non-null    object 
 famrel      394 non-null    float64
 freetime    393 non-null    float64
 goout       392 non-null    float64
 Dalc        391 non-null    float64
 Walc        393 non-null    float64
 health      391 non-null    float64
absences    392 non-null    float64
grade       395 non-null    int64  
G1          392 non-null    float64
G2          392 non-null    float64
dtypes: float64(9), int64(1), object(4)

Null 값이 있음을 확인

1.2 시각화

fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(16, 16))

columns = df.columns.to_list()

for i, column in enumerate(columns):
    row = i // 4
    col = i % 4
    axes[row, col].hist(df[column], bins=20, color='skyblue', alpha=0.7)
    axes[row, col].set_title(f'Histogram of {column}')
    axes[row, col].set_xlabel(column)
    axes[row, col].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# 연속형
continuous_columns = ['absences', 'grade', 'G1', 'G2']

# 범주형
categorical_columns = ['school', 'sex', 'paid', 'activities', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health']

📍 연속형-히스토그램

fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(15, 4))

for i, column in enumerate(continuous_columns):
    df[column].plot(kind='hist', bins=20, ax=axes[i], color='skyblue', edgecolor='black')
    axes[i].set_title(f'Histogram of {column}')
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()

plt.show()

📍 연속형-박스플롯

fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(15, 4))

for i, column in enumerate(continuous_columns):
    sns.boxplot(x=df[column], ax=axes[i], color='skyblue')
    axes[i].set_title(f'Histogram of {column}')
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()

plt.show()

📍 범주형-막대그래프

fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(15, 5))

for i, column in enumerate(categorical_columns):
    row = i // 5
    col = i % 5
    df[column].value_counts().sort_index().plot(kind='bar', ax=axes[row, col], color='skyblue')
    axes[row, col].set_title(f'Bar Plot of {column}')
    axes[row, col].set_xlabel(column)
    axes[row, col].set_ylabel('Frequency')    

plt.tight_layout()
plt.show()

📍상관관계

df_cor = df.corr(method='pearson')
sns.heatmap(df_cor,
             xticklabels = df_cor.columns,
             yticklabels = df_cor.columns,
             cmap='RdBu_r',
             annot=True, 
             linewidth=3)

1.3 데이터 탐색 결과

결측치
- famrel, freetime, goout, Dalc, Walc, health, absences, G1, G2에 결측치가 존재함을 확인
이상치
- absences 데이터는 이상치가 있을 가능성이 있음
불균형
- school 변수에 불균형이 있음을 확인
상관관계
- G1과 G2간의 상관관계가 높기에 선형 모델 사용시 다중공선성 문제를 확인해야함

2. 결측치를 식별·예측하는 두 가지 방법을 쓰고, 이를 선택한 이유를 설명하시오.

2.1 결측치 확인

df_isna = pd.DataFrame(df.isna().sum(), columns=['cnt'])
df_isna

📍 결측치 데이터

null_df = df[df.isna().any(axis=1)]
null_df

모든 행이 결측치인 경우는 없으므로 제거보다는 대체가 효과적

📍 결측치 비율

null_percentage_by_row = df.isnull().mean() * 100
null_percentage_df = pd.DataFrame({'percent': null_percentage_by_row})
null_percentage_df[null_percentage_df.percent > 0]

null_percentage = (len(null_df) / len(df))*100
print("전체 데이터의 결측치 비율: {}".format(null_percentage))

전체 데이터의 결측치 비율: 6.075949367088607

2.2 결측치 식별, 예측 방법

📍단순 대체

수치형: 평균이나 중앙값으로 대체
범주형: 최빈값을 사용하여 대체

cate_null_col = ['famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health']
num_null_col = ['absences', 'G1', 'G2']

for col in cate_null_col:
    mod = df[col].mode().iloc[0]
    print("{}의 결측치는 {}로 대체함".format(col, mod))
    df[col].fillna(mod, inplace=True)

famrel의 결측치는 4.0로 대체함
freetime의 결측치는 3.0로 대체함
goout의 결측치는 3.0로 대체함
Dalc의 결측치는 1.0로 대체함
Walc의 결측치는 1.0로 대체함
health의 결측치는 5.0로 대체함

📍KNN을 사용한 대체

from sklearn.impute import KNNImputer

# 결측치가 있는 수치형 데이터만을 추출
KNN_data = df.drop(columns=['school','sex','paid','activities'])
#모델링
imputer = KNNImputer()
df_filled = imputer.fit_transform(KNN_data)
df_filled = pd.DataFrame(df_filled, columns=KNN_data.columns)
df[KNN_data.columns] = df_filled

df.isna().sum()

school        0
sex           0
paid          0
activities    0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
grade         0
G1            0
G2            0

3. 범주형 변수 인코딩이 필요한 경우를 식별하고, 변환을 적용하시오. 이를 선택한 이유를 설명하시오.

📍 One-hot Encoding

범주형 변수 중 school, sex, paid, activites는 순서가 없는 명목형이기에 Onehot 사용

df = pd.get_dummies(data = df, columns=['school', 'sex', 'paid', 'activities'], drop_first=True)
df.info()

Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 famrel          395 non-null    float64
 freetime        395 non-null    float64
 goout           395 non-null    float64
 Dalc            395 non-null    float64
 Walc            395 non-null    float64
 health          395 non-null    float64
 absences        395 non-null    float64
 grade           395 non-null    float64
 G1              395 non-null    float64
 G2              395 non-null    float64
school_MS       395 non-null    uint8  
sex_M           395 non-null    uint8  
paid_yes        395 non-null    uint8  
activities_yes  395 non-null    uint8  

4. 데이터 분할 방법을 2가지 쓰고 적절한 데이터 분할을 적용하시오. 이를 선택한 이유를 설명하시오.

4.1 데이터 분할 방법

랜덤 분할
- 데이터를 학습 데이터와 검증 데이터로 나눌 때, 사용자가 지정한 비율에 따라 무작위로 분할합니다.
- 주로 모델의 일반화 성능을 평가하기 위해 사용됩니다.
층화 추출
- 종속 변수가 범주형 변수인 경우에 사용합니다.
- 각 클래스의 비율을 고려하여 데이터를 분할하여 클래스 편향을 방지합니다.

☞ 종속변수가 수치형인 grade이기에 랜덤 분할을 사용함

4.2 데이터 분할 수행

from sklearn.model_selection import train_test_split 
X = df.drop('grade',axis=1)
y = df['grade']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=2022)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(276, 13)
(119, 13)
(276,)
(119,)

5. 여러 모델을 사용하여 모델링을 수행하고 최적의 결과 도출

from sklearn.svm import SVR 
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import time

scaler = StandardScaler() 
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

5.1 SVR

param_grid = [{'C': [0.1, 1,10,100],'gamma': [0.001, 0.01, 0.1, 1, 10]}]
grid_svm = GridSearchCV(SVR(), param_grid =param_grid, cv =5)
grid_svm.fit(X_train_scaled, y_train)
result = pd.DataFrame(grid_svm.cv_results_['params'])
result['mean_test_score'] = grid_svm.cv_results_['mean_test_score']
result.sort_values(by='mean_test_score', ascending=False)

start_time = time.time()
svr = SVR(C=100, gamma =0.001) 
svr.fit(X_train_scaled, y_train)
end_time = time.time()
execution_time = end_time - start_time
print("코드 실행 시간: {:.2f} 초".format(execution_time))
print("R2 : ", svr.score(X_test_scaled, y_test))
print("RMSE:", np.sqrt(mean_squared_error(y_test,svr.predict(X_test_scaled))))

코드 실행 시간: 0.01 초
R2 :  0.9575168742687914
RMSE: 0.7743847717785174

5.2 Random Forest Regressor

rf_grid = [
 {'max_depth': [2,4,6,8,10], 'min_samples_split': [2, 4, 6, 8, 10]}
]
rf = GridSearchCV(RandomForestRegressor(n_estimators=100), param_grid =rf_grid, cv =5)
rf.fit(X_train, y_train)
print(rf.best_params_)

{'max_depth': 8, 'min_samples_split': 4}

start_time = time.time()
rf=RandomForestRegressor(**rf.best_params_)
rf.fit(X_train, y_train)
end_time = time.time()
execution_time = end_time - start_time
print("코드 실행 시간: {:.2f} 초".format(execution_time))
print("R2 : ", rf.score(X_test, y_test))
print("RMSE: ", np.sqrt(mean_squared_error(y_test,rf.predict(X_test))))

코드 실행 시간: 0.10 초
R2 :  0.9594471856156329
RMSE:  0.756587344467249

5.3 Xgboost Regerssor

xgb_grid = [
 {'max_depth': [2,4,6,8,10]}
]
xgb = GridSearchCV(XGBRegressor(n_estimators=1000), param_grid =xgb_grid, cv =5)
xgb.fit(X_train, y_train)
print(xgb.best_params_)

{'max_depth': 4}

start_time = time.time()
xgb=XGBRegressor(**xgb.best_params_)
xgb.fit(X_train, y_train)
end_time = time.time()
execution_time = end_time - start_time
print("코드 실행 시간: {:.2f} 초".format(execution_time))
print("R2 : ", xgb.score(X_test, y_test))
print("RMSE: ", np.sqrt(mean_squared_error(y_test,xgb.predict(X_test))))

코드 실행 시간: 0.11 초
R2 :  0.9516113211392506
RMSE:  0.8264573665053062

from xgboost import plot_importance
plot_importance(xgb)

통계분석

1. 회귀분석

1.1 데이터를 8 : 2로 분할하고 선형 회귀를 적용하시오. 결정계수와 rmse를 구하시오.

import pandas as pd 
import numpy as np 
import mglearn

# 데이터 불러오기
X,y = mglearn.datasets.load_extended_boston()

# 훈련, 테스트 세트 분리
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression() 
lr.fit(X_train,y_train)

## R2
print("선형 회귀 결정계수 : ", lr.score(X_test,y_test))
print("선형 회귀 RMSE : ", np.sqrt(mean_squared_error(y_test, lr.predict(X_test))))

선형 회귀 결정계수 :  0.6158858584078888
선형 회귀 RMSE :  5.59265723707854

1.2 데이터를 8 : 2로 분할하고 릿지 회귀를 적용하시오. alpha 값을 0부터 1까지 0.1단위로 모두 탐색해서 결정계수가 가장 높을 때의 알파를 찾고, 해당 알파로 다시 모델을 학습해서 결정계수와 rmse를 계산하시오.

from sklearn.linear_model import Ridge 
from sklearn.model_selection import GridSearchCV

alpha = np.arange(0,1.1,0.1)
ridge = Ridge() 
param_grid = {'alpha':alpha}
ridge_model = GridSearchCV(ridge, param_grid)
ridge_model.fit(X_train,y_train)
print(ridge_model.best_params_)
print("릿지 회귀 결정계수 : ", ridge_model.score(X_test,y_test))
print("릿지 회귀 RMSE : ", np.sqrt(mean_squared_error(y_test, ridge_model.predict(X_test))))

{'alpha': 0.1}
릿지 회귀 결정계수 :  0.7463824108919223
릿지 회귀 RMSE :  4.544412437236844

1.3 데이터를 8 : 2로 분할하고 라쏘 회귀를 적용하시오. alpha 값을 0부터 1까지 0.1단위로 모두 탐색해서 결정계수가 가장 높을 때의 알파를 찾고, 해당 알파로 다시 모델을 학습해서 결정계수와 rmse를 계산하시오.

from sklearn.linear_model import Lasso

lasso = Lasso() 
param_grid = {'alpha':alpha}
lasso_model = GridSearchCV(lasso, param_grid)
lasso_model.fit(X_train,y_train)
print(lasso_model.best_params_)
print("라쏘 회귀 결정계수 : ", lasso_model.best_estimator_.score(X_test,y_test))
print("라쏘 회귀 RMSE : ", np.sqrt(mean_squared_error(y_test, lasso_model.best_estimator_.predict(X_test))))

{'alpha': 0.0}
라쏘 회귀 결정계수 :  0.6901880385279766
라쏘 회귀 RMSE :  5.022698918447346

2. 다항분석 시각화

2.1 단순 선형 회귀를 다항 회귀로 3차까지 적용시켜 계수를 구하고 3차항을 적용한 모델의 스캐터 플롯과 기울기 선을 그리시오.

import pandas as pd
import numpy as np
m =100
X =6 * np.random.rand(m,1) -3 # [-3,3]의 랜덤한 값
y =3 * X**3 + X**2 +2*X +2 + np.random.randn(m,1) #노이즈 포함
line = np.linspace(-3,3,100, endpoint=False).reshape(-1,1)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures

## x**3 까지 3차항을 적용시켜야 함
poly = PolynomialFeatures(degree=3, include_bias=False)
poly.fit(X)
X_poly = poly.transform(X)
line_poly = poly.transfㅠorm(line)
reg = LinearRegression().fit(X_poly, y)

plt.plot(line, reg.predict(line_poly), c='r',linewidth=3) # line은 -3부터 3까지의 x축
plt.plot(X,y,'o',c ='g', alpha=0.5)

3. ANOVA 분석

3.1 변수 3개(하나는 연속형 변수/나머지 두 개는 범주형 연속변수)의 이원분산분석을 수행하고 통계표를 작성하시오.

import pandas as pd
import numpy as np
avocado = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/avocado.csv")
avocado = avocado[["AveragePrice","type","region"]]
avocado = avocado[(avocado['region']=='Orlando') |
                    (avocado['region']=='Boston' )|
                    (avocado['region']=='Chicago')].reset_index(drop=True)

avocado

3.2 가설수립

📍 상호작용효과 검정

H0: region과 avocado type 간에는 상호작용 효과가 없다.
H1: region과 avocado type 간에는 상호작용 효과가 있다.

📍 주효과 검정

H0: region 종류에 따른 AveragePrice 차이는 존재하지 않는다.
H1: region 종류에 따른 AveragePrice 차이는 존재한다.
H0: type 종류에 따른 AveragePrice 차이는 존재하지 않는다.
H1: type 종류에 따른 AveragePrice 차이는 존재한다.

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# 이원 분산 분석 수행
model = ols('AveragePrice ~ C(region) + C(type) + C(region):C(type)', data=avocado).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print("이원 분산 분석 결과:")
print(anova_table)

이원 분산 분석 결과:
                      sum_sq      df           F         PR(>F)
C(region)           0.432136     2.0    3.189242   4.161918e-02
C(type)            56.111007     1.0  828.218296  1.989417e-133
C(region):C(type)   1.878817     2.0   13.866003   1.146622e-06
Residual           68.291047  1008.0         NaN            NaN

C(region):C(type): region과 type의 상호작용 효과에 대한 검정결과 p-valuerk 0.05보다 즉으므로 귀무가설을 기각하지 않는다.
- 교호작용이 존재함
교호작용이 존재하기에 주효과 검정은 의미가 없음

from statsmodels.graphics.factorplots import interaction_plot
import matplotlib.pyplot as plt

AveragePrice = avocado["AveragePrice"]
avocado_type = avocado["type"]
region = avocado["region"]

fig, ax = plt.subplots(figsize=(6, 6))
fig = interaction_plot(avocado_type, region, AveragePrice,
                       colors=['red', 'blue', 'black'], 
                       markers=['D', '^', 'o'], ms=10, ax=ax)

fig, ax = plt.subplots(figsize=(6, 6))
fig = interaction_plot(region ,avocado_type, AveragePrice,
                       colors=['red', 'blue'], 
                       markers=['D', '^'], ms=10, ax=ax)

Twitter Facebook LinkedIn

sigi