ADP 실기 11장 Naive Bayes

2023-08-23 5 분 소요

본 게시물은 Naive Bayes에 대해 소개한다.

1. Bayes` Theorem

1.1 Concept

두 확률 변수의 사전 확률과 사후 확률 사이의 관계를 나타내는 정리

1.2 Formular

Posterior: 사건 B가 발생했을 때, 사건 A가 발생할 확률로 보통 문제에서 알고자하는 것
- P(A B)
Likelihood: 사건 A가 발생했을 때, 사건 B가 발생활 확률
- P(B A)
Prior: 사건 A가 발생할 확률
- P(A)
Evidence: 관찰값, 사건 B가 발생할 확률
- P(B)

2. Naive Bayes

2.1 Concept

하나의 속성값을 기준으로 다른 속성이 독립적이라 전제했을 때 해당 속성값이 클래스 분류에 미치는 영향을 측정

2.2 Condition

데이터가 많지 않을 때: 데이터가 적으면 통계적 기법의 신뢰도가 떨어짐
목적이 미래 예측: 하나의 추정치를 고집하는 것이 아니라 값을 수정하며 현실적인 추정치를 찾아감

2.3 Kind

BernoulliNB: 이진 분류 (이산형)
MultinomialNB: 카운트 데이터 (이산형)
GaussianNB: 데이터가 연속적이며 정규분포라는 가정 하에 적용 (연속형)

3. GaussianNB

3.1 Parameters

class sklearn.naive_bayes.GaussianNB(*, priors=None, var_smoothing=1e-09) 

priors: 클래스의 사전 확률

3.2 Methods

fit(X, y)
predict(X)
predict_proba(X)
score(X, y)

3.3 Implementation

😗 데이터 불러오기

import pandas as pd
sky = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/Skyserver.csv")
sky.info()

Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 objid      10000 non-null  float64
 ra         10000 non-null  float64
 dec        10000 non-null  float64
 u          10000 non-null  float64
 g          10000 non-null  float64
 r          10000 non-null  float64
 i          10000 non-null  float64
 z          10000 non-null  float64
 run        10000 non-null  int64  
 rerun      10000 non-null  int64  
camcol     10000 non-null  int64  
field      10000 non-null  int64  
specobjid  10000 non-null  float64
class      10000 non-null  object 
redshift   10000 non-null  float64
plate      10000 non-null  int64  
mjd        10000 non-null  int64  
fiberid    10000 non-null  int64  
dtypes: float64(10), int64(7), object(1)

sky['class'].unique()

array(['STAR', 'GALAXY', 'QSO'], dtype=object)

import seaborn as sns
sns.pairplot(hue='class', data =sky[['z', 'run', 'i', 'class']])

import numpy as np
features = list(sky.columns)
features.remove('class')
X = sky[features]
y = sky['class']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3, 
random_state =1, stratify =y)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(7000, 17) (7000,)
(3000, 17) (3000,)

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
pred = gnb.fit(x_train, y_train).predict(x_test)
print("Accuracy Score : ", gnb.score(x_test, y_test))

Accuracy Score :  0.799

gnb.predict_proba(x_test)[[0, 13, 68]]

array([[8.26737014e-01, 4.43137039e-02, 1.28949282e-01],
       [5.39851854e-05, 9.64092748e-02, 9.03536740e-01],
       [8.32868012e-01, 4.48282737e-02, 1.22303715e-01]])

gnb.predict(x_test)[[0, 13, 68]]

array(['GALAXY', 'STAR', 'GALAXY'], dtype='<U6')

from sklearn.metrics import classification_report
pred=gnb.predict(x_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

      GALAXY       0.74      0.97      0.84      1499
         QSO       0.00      0.00      0.00       255
        STAR       0.91      0.75      0.83      1246

    accuracy                           0.80      3000
   macro avg       0.55      0.58      0.56      3000
weighted avg       0.75      0.80      0.76      3000

4. BernoulliNB

4.1 Parameters

class sklearn.naive_bayes.BernoulliNB(*, alpha=1.0, fit_prior=True, class_prior=None)

fit_prior: 클래스의 사전확률 학습 여부, False면 균등확률
class_prior: 클래스의 사전 확률

4.2 Methods

Gaussian과 동일

4.3 Implementation

😗 데이터 불러오기

import pandas as pd
spam = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/spam.csv", encoding ='utf-8')

spam.isna().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566

spam=spam[['v1', 'v2']]
spam

spam['v1'].unique()

array(['ham', 'spam'], dtype=object)

import numpy as np
spam['label'] = np.where(spam['v1']=='spam', 1, 0)
spam

X= spam['v2']
y = spam['label']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3, 
random_state =1, stratify =y)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(3900,) (3900,)
(1672,) (1672,)

from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(binary =True)
x_traincv = cv.fit_transform(x_train)
x_traincv.shape

(3900, 7175)

베르누이 나이브베이즈는 이산형으로(1,0) 구성된 데이터를 입력으로 받기에, CountVectorizer를 사용
binary를 True로 설정시 이메일마다 단어가 한번 이상 등장하면 1 그렇지 않으면 0으로 설정

encoded_input = x_traincv.toarray()
encoded_input

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

print(cv.inverse_transform(encoded_input[[0]]))

[array(['couple', 'down', 'give', 'me', 'minutes', 'my', 'sure', 'to',
       'track', 'wallet', 'yeah'], dtype='<U34')]

inverse_transform: 벡터로 인코딩된 이메일 제목에 어떤 단어들이 포함되어 있는지 확인함

print(cv.get_feature_names_out()[1000 :1010 ], end ='')

['at' 'ate' 'athletic' 'athome' 'atlanta' 'atlast' 'atm' 'attach'
 'attached' 'attack']

get_feature_names_out: 벡터의 인덱스가 어떤 단어를 뜻하는지 확인

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
bnb.fit(x_traincv, y_train)

x_testcv = cv.transform(x_test)
pred = bnb.predict(x_testcv)
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
print("Accuracy Score : ", acc)

Accuracy Score :  0.9754784688995215

from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.99      1448
           1       0.99      0.82      0.90       224

    accuracy                           0.98      1672
   macro avg       0.98      0.91      0.94      1672
weighted avg       0.98      0.98      0.97      1672

5. MultinomialNB

5.1 Parameters

class sklearn.naive_bayes.MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None)

위와 동일

5.2 Methods

위와 동일

5.3 Implementation

😗 데이터 불러오기

from keras.datasets import imdb
(X_train, y_train), (X_test, y_test) = imdb.load_data()
print(X_train.shape)
print(X_test.shape)

(25000,)
(25000,)

import pandas as pd

word_to_index = imdb.get_word_index()
index_to_word = {}
for key, value in word_to_index.items():
 index_to_word[value+3] = key

for index, token in enumerate(("<pad>", "<sos>", "<unk>")):
 index_to_word[index] = token

train_reviews = []
for X in X_train:
 tmp =' '.join([index_to_word[index] for index in X])
 train_reviews.append(tmp)

test_reviews = []
for X in X_test:
 tmp =' '.join([index_to_word[index] for index in X])
 test_reviews.append(tmp)

train = pd.concat([pd.DataFrame(train_reviews), pd.DataFrame(y_train)], axis =1)
train.columns = ['reviews', 'label']
train['reviews'] = train['reviews'].str[6:]
test = pd.concat([pd.DataFrame(test_reviews), pd.DataFrame(y_test)], axis=1)
test.columns = ['reviews', 'label']
train['reviews'] = train['reviews'].str[6:]

print("<<<<<<<<< Train Dataset for MNB >>>>>>>>>")
train.head()

print("<<<<<<<<< Test Dataset for MNB >>>>>>>>>")
test.head()

x_train, x_test = train['reviews'].values, test['reviews'].values
y_train, y_test = train['label'].values, test['label'].values
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(25000,) (25000,)
(25000,) (25000,)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary=False)
x_traincv = cv.fit_transform(x_train)
x_traincv.shape

(25000, 76521)

print(cv.inverse_transform(x_traincv)[0 ])

['ilm' 'was' 'just' 'brilliant' 'casting' 'location' 'scenery' 'story'
 'direction' 'everyone' 'really' 'suited' 'the' 'part' 'they' 'played'
 'and' 'you' 'could' 'imagine' 'being' 'there' 'robert' 'redford' 'is'
 'an' 'amazing' 'actor' 'now' 'same' 'director' 'norman' 'father' 'came'
 'from' 'scottish' 'island' 'as' 'myself' 'so' 'loved' 'fact' 'real'
 'connection' 'with' 'this' 'film' 'witty' 'remarks' 'throughout' 'were'
 'great' 'it' 'much' 'that' 'bought' 'soon' 'released' 'for' 'retail'
 'would' 'recommend' 'to' 'watch' 'fly' 'fishing' 'cried' 'at' 'end' 'sad'
 'know' 'what' 'say' 'if' 'cry' 'must' 'have' 'been' 'good' 'definitely'
 'also' 'congratulations' 'two' 'little' 'boy' 'of' 'paul' 'children'
 'are' 'often' 'left' 'out' 'praising' 'list' 'think' 'because' 'stars'
 'play' 'them' 'all' 'grown' 'up' 'such' 'big' 'profile' 'whole' 'but'
 'these' 'should' 'be' 'praised' 'done' 'don' 'lovely' 'true' 'someone'
 'life' 'after' 'shared' 'us']

print (cv.get_feature_names_out()[-10 :])

['était' 'état' 'étc' 'évery' 'êxtase' 'ís' 'ísnt' 'østbye' 'über'
 'üvegtigris']

from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(x_traincv, y_train)

from sklearn.metrics import accuracy_score, classification_report
x_testcv = cv.transform(x_test)
pred = mnb.predict(x_testcv)
acc = accuracy_score(y_test, pred)
print("Accuracy Score : ", acc)

Accuracy Score :  0.81932

from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.79      0.87      0.83     12500
           1       0.85      0.77      0.81     12500

    accuracy                           0.82     25000
   macro avg       0.82      0.82      0.82     25000
weighted avg       0.82      0.82      0.82     25000

Twitter Facebook LinkedIn

sigi

ADP 실기 11장 Naive Bayes

1. Bayes` Theorem

1.1 Concept

1.2 Formular

2. Naive Bayes

2.1 Concept

2.2 Condition

2.3 Kind

3. GaussianNB

3.1 Parameters

3.2 Methods

3.3 Implementation

4. BernoulliNB

4.1 Parameters

4.2 Methods

4.3 Implementation

5. MultinomialNB

5.1 Parameters

5.2 Methods

5.3 Implementation

공유하기

댓글남기기

참고

Llama 3.1 한국어 Finetuning

LLM의 활용 방법

k8s (3) Pod 정리

Continuous distribution (3) Gamma Distribution