ADP 실기 7장 Support Vector Machine

2023-08-15 3 분 소요

본 게시물은 Support Vector Machine에 대해 소개한다.

1. Support Vector Machine

1.1 Concpet

SVM은 새로운 데이터가 입력되었을 때 기존 데이터를 활용해 분류하는 방법으로, SVC를 확장한 알고리즘

1.2 최대 마진 분류기

초평면: p차원 공간에서 차원이 (p-1)인 평평한 affin 부분 공간
margin: 관측치부터 초평면 사이의 가장 짧은 거리
maximal margin: 관측치로부터 마진이 가장 큰 평면을 찾음
support vector: 초평면에서 가장 가까운 관측치

1.3 서포트 벡터 분류기

일부 오류를 수용하는 최대 마진을 가진 초평면 분류기

1.4 서포트 벡터 머신

SVC는 데이터가 두 클래스로 나뉘고, 경계가 선형일 경우
SVM은 SVC의 개념을 확장하여 커널 함수를 사용해 고차원에서 선형 분류를 하는 기법

1.5 서포트 벡터 회귀

SVM은 일정한 마진 오류 안에서 두 클래스 간의 도로 폭이 최대가 되도록 학습
SVR은 제한된 마진 오류 안에서 도로 안에 가능한 많은 데이터 샘플이 속하도록 학습

2. Support Vector Classifier

2.1 Parameters

class sklearn.svm.SVC(*, C=1.0, kernel=‘rbf’, degree=3, gamma=‘scale’, coef0=0.0, 
shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, 
verbose=False, max_iter=- 1, decision_function_shape=‘ovr’, break_ties=False, 
random_state=None)

C: 오분류 허용 계수 (C>0)
kernel: linear, poly, rbf, sigmoid, precomputed
degree: 커널 함수를 poly로 하였을 때 설정함
gamma: 커널 함수를 poly, rbf, sigmoid로 설정하였을 때, scaling 계수
class_weight: 각 클래스의 가중치로 None일때 모든 가중치는 1

2.2 Attributes

class_weight_: 각 클래스의 가중치
coef_: 커널을 linear로 하였을 때 계수
support_vectors_: 서포트 벡터

2.3 Methods

decision_function(X): 데이터 샘플의 confidence score를 반환
predict(X), predict_proba(X), predict_log_proba(X)

2.4 Implementation

😗 데이터 불러오기

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
c=pd.read_csv(“https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/classification.csv”)
c

# 데이터 분포 확인
sns.pairplot(hue='success', data=c)

# 데이터 분리
from sklearn.model_selection import train_test_split
x=c[['age', 'interest']]
y=c['success']
train_x, test_x, train_y, test_y = train_test_split(x,y,stratify=y, 
train_size=0.7, random_state=1)
print(train_x.shape, test_x.shape, train_y.shape, test_y.shape)

(207, 2) (90, 2) (207,) (90,)

# 스케일링 적용 및 데이터 분포 확인
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
train_x=scaler.fit_transform(train_x)
sns.pairplot(data=pd.concat([pd.DataFrame(train_x),
 train_y.reset_index(drop=True)],
 axis=1),
 hue='success')

SVM은 이상치에 민감하기에 Standard Scaler를 적용
Scaler 적용 후 데이터의 분포를 명확히 확인 가능

# 분류기 모델 생성 및 학습
from sklearn.svm import SVC
clf = SVC(C=0.5, random_state=45)
clf.fit(train_x, train_y)

# 평가
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
test_x_scal = scaler.transform(test_x)
pred=clf.predict(test_x_scal)
test_cm=confusion_matrix(test_y, pred)
test_acc=accuracy_score(test_y, pred)
test_prc=precision_score(test_y, pred)
test_rcll=recall_score(test_y, pred)
test_f1=f1_score(test_y, pred)
print(test_cm)
print('정확도\t{}%'.format(round(test_acc*100,2)))
print('정밀도\t{}%'.format(round(test_prc*100,2)))
print('재현율\t{}%'.format(round(test_rcll*100,2)))
print('F1\t{}%'.format(round(test_f1*100,2)))

[[37  2]
 [10 41]]
정확도	86.67%
정밀도	95.35%
재현율	80.39%
F1	87.23%

📍 오분류 계수에 따른 Margin의 변화

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import LinearSVC
plt.figure(figsize=(10, 5))
for i, C in enumerate([1, 500]):
 clf = LinearSVC(C=C, loss="hinge", random_state=42).fit(train_x, 
train_y)
 # decision function으로 서포트 벡터 얻기
 decision_function = clf.decision_function(train_x)
 support_vector_indices = np.where(np.abs(decision_function) <= 1 +
1e-15)[0]
 support_vectors = train_x[support_vector_indices]
 plt.subplot(1, 2, i + 1)
 plt.scatter(train_x[:, 0], train_x[:, 1], c=train_y, s=30, cmap
=plt.cm.Paired)
 ax = plt.gca()
 xlim = ax.get_xlim()
 ylim = ax.get_ylim()
 xx, yy = np.meshgrid(
 np.linspace(xlim[0], xlim[1], 50), np.linspace(ylim[0], ylim[1], 50)
 )
 Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
 Z = Z.reshape(xx.shape)
 plt.contour(
 xx,
 yy,
 Z,
 colors="k",
 levels=[-1, 0, 1],
 alpha=0.5,
 linestyles=["--", "-", "--"],
 )
 plt.scatter(
 support_vectors[:, 0],
 support_vectors[:, 1],
 s=100,
 linewidth=1,
 facecolors="none",
 edgecolors="k",
 )
plt.title("C=" + str(C))
plt.tight_layout()
plt.show()

3. Support Vector Regression

3.1 Parameters

class sklearn.svm.SVR(*, kernel=‘rbf’, degree=3, gamma=‘scale’, coef0=0.0, 
tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, 
max_iter=- 1)

SVC와 동일

3.2 Attributes

SVC와 동일

3.3 Methods

분류가 아니기에 prob를 구하는 method는 없음
나머진 동일

3.4 Implementation

😗 데이터 생성하기

import numpy as np
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel()
print(X[0:6], '\n\n',y[0:10])

[[0.52871606]
 [0.57116235]
 [0.66348438]
 [0.69907167]
 [0.85967932]
 [0.96984676]] 

 [0.50442513 0.54061027 0.61586575 0.64350738 0.7576333  0.82479908
 0.83998352 0.86173085 0.86618382 0.89379124]

# 타깃데이터에 노이즈 추가하기
y[::5] += 3 * (0.5 - np.random.rand(8))
print(y[0:10])

노이즈를 추가함으로 실제 환경과 비슷하게 함

# 커널별 회귀 모델 생성 및 학습
from sklearn.svm import SVR
svr_rbf = SVR(kernel="rbf", C=100, gamma=0.1, epsilon=0.1)
svr_lin = SVR(kernel="linear", C=100, gamma="auto")
svr_poly = SVR(kernel="poly", C=100, gamma="auto", degree=3, 
epsilon=0.1, coef0=1)
svr_rbf.fit(X, y)
svr_lin.fit(X, y)
svr_poly.fit(X, y)

# 커널별 모델 평가
rbf_pred=svr_rbf.predict(X)
lin_pred=svr_lin.predict(X)
poly_pred=svr_poly.predict(X)
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error
import pandas as pd
import numpy as np
preds = [rbf_pred, lin_pred, poly_pred]
kernel = ['Random_Forest', 'Linear', 'Polynomial']
evls = ['mse', 'rmse', 'mae']
results=pd.DataFrame(index=kernel,columns=evls)
for pred, nm in zip(preds, kernel):
 mse = mean_squared_error(y, pred)
 mae = mean_absolute_error(y, pred)
 rmse = np.sqrt(mse)
 
 results.loc[nm]['mse']=round(mse,2)
 results.loc[nm]['rmse']=round(rmse,2)
 results.loc[nm]['mae']=round(mae,2)
results

lw = 2
svrs = [svr_rbf, svr_lin, svr_poly]
kernel_label = ["RBF", "Linear", "Polynomial"]
model_color = ["m", "c", "g"]
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 10), sharey=True)
for ix, svr in enumerate(svrs):
   axes[ix].plot(
     X,
     svr.fit(X, y).predict(X),
     color=model_color[ix],
     lw=lw,
     label="{} model".format(kernel_label[ix]),
   )
   axes[ix].scatter(
     X[svr.support_],
     y[svr.support_],
     facecolor="none",
     edgecolor=model_color[ix],
     s=50,
     label="{} support vectors".format(kernel_label[ix]),
   )
   axes[ix].scatter(
     X[np.setdiff1d(np.arange(len(X)), svr.support_)],
     y[np.setdiff1d(np.arange(len(X)), svr.support_)],
     facecolor="none",
     edgecolor="k",
     s=50,
     label="other training data",
   )
   axes[ix].legend(
     loc="upper center",
     bbox_to_anchor=(0.5, 1.1),
     ncol=1,
     fancybox=True,
     shadow=True,
   )
fig.text(0.5, 0.04, "data", ha="center", va="center")
fig.text(0.06, 0.5, "target", ha="center", va="center", rotation="vertical")
fig.suptitle("Support Vector Regression", fontsize=14)
plt.show()

Twitter Facebook LinkedIn

sigi

ADP 실기 7장 Support Vector Machine

1. Support Vector Machine

1.1 Concpet

1.2 최대 마진 분류기

1.3 서포트 벡터 분류기

1.4 서포트 벡터 머신

1.5 서포트 벡터 회귀

2. Support Vector Classifier

2.1 Parameters

2.2 Attributes

2.3 Methods

2.4 Implementation

3. Support Vector Regression

3.1 Parameters

3.2 Attributes

3.3 Methods

3.4 Implementation

공유하기

댓글남기기

참고

Llama 3.1 한국어 Finetuning

LLM의 활용 방법

k8s (3) Pod 정리

Continuous distribution (3) Gamma Distribution