ADP 실기 6장 Logistic Regression

2023-08-13 2 분 소요

본 게시물은 Logistic Regression에 대해 소개한다.

1. Simple Logistic Regression

1.1 Concept

샘플이 특정 클래스에 속할 확률을 추정하는 분류문제와 같이 target이 범주형일 경우 적용하는 회귀분석을 Logistic Regerssion이라함

Odds: 실패에 비해 성공할 확률, 성공 확률이 P일 경우 p/(p-1)로 구할 수 있음
Logit Transform: log(odds)를 나타내며, Y축이 확률인 Logistic Regression을 logit 변환을 통해 실수로 치환함
Maximum Likelihood: Logistic Regression의 best fitted line을 찾기 위해 사용, 즉 비용함수 개념

1.2 Parameters

class sklearn.linear_model.LogisticRegression(penalty=‘l2’, *, dual=False, tol=0.0001, 
C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, 
solver=‘lbfgs’, max_iter=100, multi_class=‘auto’, 
verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

class_weight: 클래스와 관련된 가중치, 지정하지 않으면 모두 1
solver: 최적화 문제를 푸는 해를 구할 때 사용할 알고리즘을 선택, 데이터 셋의 크기가 작으면 liblinear가 좋다고 알려짐, 다중 클래스 분류일 경우 newton-cg, sag, saga, lbfgs

1.3 Attributes

classes_: 라벨링된 클래스
coef_: feature에 할당된 가중치

1.4 Methods

decision_function: 샘플데이터에 대해 예측한 값이 Hyperplane으로 부터 얼마나 떨어져 있는지 (logistic에서 hyper plane은 0)
predict(X): 예측값을 array로 반환
predict_log_proba(X): 클래스에 대한 샘플의 로그 확률
predict_proba(X): 클래스에 대한 샘플의 확률

1.4 Implementation

😗 데이터 불러오기

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
body=pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/bodyPerformance.csv")

📍성별 분류기 만들기

# 성별을 0/1 으로 변환
body['gender']=np.where(body['gender']=='M', 0, 1)
body['class_1']=np.where(body['class']=='A', 1, 0)
body

# 데이터 셋 분류
from sklearn.model_selection import train_test_split
feature_columns = list(body.columns.difference(['class', 'class_1']))
x=body[feature_columns]
y=body['class_1']
train_x, test_x, train_y, test_y=train_test_split(x,y,stratify=y, 
train_size=0.7, random_state=1)
print(train_x.shape, test_x.shape, train_y.shape, test_y.shape)

(9375, 11) (4018, 11) (9375,) (4018,)

# 모델 학습
from sklearn.linear_model import LogisticRegression
logR=LogisticRegression(random_state=45)
logR.fit(train_x, train_y)

# 결과 시각화
proba=pd.DataFrame(logR.predict_proba(train_x))
cs=logR.decision_function(train_x)
df=pd.concat([proba, pd.DataFrame(cs)], axis=1)
df.columns=['Not A','A', 'decision_function']
df.sort_values(['decision_function'], inplace=True)
df.reset_index(inplace=True, drop=True)
df

import matplotlib.pyplot as plt

plt.figure(figsize=(15,5))
plt.axhline(y=0.5, linestyle='--', color='black', linewidth=1)
plt.axvline(x=0, linestyle='--', color='black', linewidth=1)
plt.plot(df['decision_function'], df['Not A'], 'g--', label='Not A')
plt.plot(df['decision_function'], df['Not A'], 'g^')
plt.plot(df['decision_function'], df['A'], 'b--', label='A')
plt.plot(df['decision_function'], df['A'], 'b*')
plt.xlabel
plt.ylabel
plt.legend(loc='upper left')
plt.show()

# 평가
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
pred=logR.predict(test_x)
test_cm=confusion_matrix(test_y, pred)
test_acc=accuracy_score(test_y, pred)
test_prc=precision_score(test_y, pred)
test_rcll=recall_score(test_y, pred)
test_f1=f1_score(test_y, pred)
print(test_cm)
print('\n')
print('정확도\t{}%'.format(round(test_acc*100,2)))
print('정밀도\t{}%'.format(round(test_prc*100,2)))
print('재현율\t{}%'.format(round(test_rcll*100,2)))
print('F1\t{}%'.format(round(test_f1*100,2)))

[[2763  251]
 [ 342  662]]

정확도	85.24%
정밀도	72.51%
재현율	65.94%
F1	69.07%

# ROC curve 시각화
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_estimator(logR, test_x, test_y)
plt.show()

2. Multi Class Softmax Regression

2.1 Concept

로지스틱 회귀를 2개 이상 클래스인 다중 클래스에 대해 일반화 한 것

softmax: K 차원의 벡터를 입력으로 받아 각 차원(클래스)에 속할 확률을 계산함
logistic의 multi_class를 multinomail로, solver에는 알고리즘을 적용

2.2 Implementation

😗 데이터 불러오기

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings(‘ignore’)
body = pd.read_csv(“https://raw.githubusercontent.com/ADPclass/ADP_
book_ver01/main/data/bodyPerformance.csv”)

📍 Class 분류기

# gender 변수 전처리
body['gender']=np.where(body['gender']=='M', 0, 1)
# class 변수 전처리
mapping={'A':0, 'B':1, 'C':2, 'D':4}
body['class_2']=body['class'].map(mapping)
body

# 데이터 분리
from sklearn.model_selection import train_test_split
feature_columns = list(body.columns.difference(['class', 'class_2']))
x=body[feature_columns]
y=body['class_2']
train_x, test_x, train_y, test_y = train_test_split(x,y,stratify=y, 
train_size=0.7, random_state=1)
print(train_x.shape, test_x.shape, train_y.shape, test_y.shape)

(9375, 11) (4018, 11) (9375,) (4018,)

# 모델 학습
from sklearn.linear_model import LogisticRegression
softm=LogisticRegression(multi_class='multinomial', solver='lbfgs', C=10, 
random_state=45)
softm.fit(train_x, train_y)

# 평가
from sklearn.metrics import confusion_matrix, accuracy_score
pred=softm.predict(test_x)
test_cm=confusion_matrix(test_y, pred)
test_acc=accuracy_score(test_y, pred)
print(test_cm)
print('\n')
print('정확도\t{}%'.format(round(test_acc*100,2)))

[[707 261  36   0]
 [269 403 300  32]
 [ 92 207 525 181]
 [ 13  63 157 772]]

정확도	59.91%

print("샘플의 class 예측")
print(softm.predict([test_x.iloc[-1,:]]))
print("각 class에 속할 확률")
print(softm.predict_proba([test_x.iloc[-1,:]]))

샘플의 class 예측
[0]

각 class에 속할 확률
[[0.62639722 0.311902   0.06015632 0.00154446]]

Twitter Facebook LinkedIn

sigi

ADP 실기 6장 Logistic Regression

1. Simple Logistic Regression

1.1 Concept

1.2 Parameters

1.3 Attributes

1.4 Methods

1.4 Implementation

2. Multi Class Softmax Regression

2.1 Concept

2.2 Implementation

공유하기

댓글남기기

참고

Llama 3.1 한국어 Finetuning

LLM의 활용 방법

k8s (3) Pod 정리

Continuous distribution (3) Gamma Distribution