5 분 소요

본 게시물은 Naive Bayes에 대해 소개한다.

1. Bayes` Theorem

1.1 Concept

두 확률 변수의 사전 확률과 사후 확률 사이의 관계를 나타내는 정리

1.2 Formular

  • Posterior: 사건 B가 발생했을 때, 사건 A가 발생할 확률로 보통 문제에서 알고자하는 것
    • P(A B)
  • Likelihood: 사건 A가 발생했을 때, 사건 B가 발생활 확률
    • P(B A)
  • Prior: 사건 A가 발생할 확률
    • P(A)
  • Evidence: 관찰값, 사건 B가 발생할 확률
    • P(B)

2. Naive Bayes

2.1 Concept

하나의 속성값을 기준으로 다른 속성이 독립적이라 전제했을 때 해당 속성값이 클래스 분류에 미치는 영향을 측정

2.2 Condition

  • 데이터가 많지 않을 때: 데이터가 적으면 통계적 기법의 신뢰도가 떨어짐
  • 목적이 미래 예측: 하나의 추정치를 고집하는 것이 아니라 값을 수정하며 현실적인 추정치를 찾아감

2.3 Kind

  • BernoulliNB: 이진 분류 (이산형)
  • MultinomialNB: 카운트 데이터 (이산형)
  • GaussianNB: 데이터가 연속적이며 정규분포라는 가정 하에 적용 (연속형)

3. GaussianNB

3.1 Parameters

class sklearn.naive_bayes.GaussianNB(*, priors=None, var_smoothing=1e-09) 
  • priors: 클래스의 사전 확률

3.2 Methods

  • fit(X, y)
  • predict(X)
  • predict_proba(X)
  • score(X, y)

3.3 Implementation

😗 데이터 불러오기

import pandas as pd
sky = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/Skyserver.csv")
sky.info()
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   objid      10000 non-null  float64
 1   ra         10000 non-null  float64
 2   dec        10000 non-null  float64
 3   u          10000 non-null  float64
 4   g          10000 non-null  float64
 5   r          10000 non-null  float64
 6   i          10000 non-null  float64
 7   z          10000 non-null  float64
 8   run        10000 non-null  int64  
 9   rerun      10000 non-null  int64  
 10  camcol     10000 non-null  int64  
 11  field      10000 non-null  int64  
 12  specobjid  10000 non-null  float64
 13  class      10000 non-null  object 
 14  redshift   10000 non-null  float64
 15  plate      10000 non-null  int64  
 16  mjd        10000 non-null  int64  
 17  fiberid    10000 non-null  int64  
dtypes: float64(10), int64(7), object(1)
sky['class'].unique()
array(['STAR', 'GALAXY', 'QSO'], dtype=object)
import seaborn as sns
sns.pairplot(hue='class', data =sky[['z', 'run', 'i', 'class']])

import numpy as np
features = list(sky.columns)
features.remove('class')
X = sky[features]
y = sky['class']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3, 
random_state =1, stratify =y)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
(7000, 17) (7000,)
(3000, 17) (3000,)
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
pred = gnb.fit(x_train, y_train).predict(x_test)
print("Accuracy Score : ", gnb.score(x_test, y_test))
Accuracy Score :  0.799
gnb.predict_proba(x_test)[[0, 13, 68]]
array([[8.26737014e-01, 4.43137039e-02, 1.28949282e-01],
       [5.39851854e-05, 9.64092748e-02, 9.03536740e-01],
       [8.32868012e-01, 4.48282737e-02, 1.22303715e-01]])
gnb.predict(x_test)[[0, 13, 68]]
array(['GALAXY', 'STAR', 'GALAXY'], dtype='<U6')
from sklearn.metrics import classification_report
pred=gnb.predict(x_test)
print(classification_report(y_test, pred))
              precision    recall  f1-score   support

      GALAXY       0.74      0.97      0.84      1499
         QSO       0.00      0.00      0.00       255
        STAR       0.91      0.75      0.83      1246

    accuracy                           0.80      3000
   macro avg       0.55      0.58      0.56      3000
weighted avg       0.75      0.80      0.76      3000

4. BernoulliNB

4.1 Parameters

class sklearn.naive_bayes.BernoulliNB(*, alpha=1.0, fit_prior=True, class_prior=None)
  • fit_prior: 클래스의 사전확률 학습 여부, False면 균등확률
  • class_prior: 클래스의 사전 확률

4.2 Methods

  • Gaussian과 동일

4.3 Implementation

😗 데이터 불러오기

import pandas as pd
spam = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/spam.csv", encoding ='utf-8')
spam.isna().sum()
v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
spam=spam[['v1', 'v2']]
spam

spam['v1'].unique()
array(['ham', 'spam'], dtype=object)
import numpy as np
spam['label'] = np.where(spam['v1']=='spam', 1, 0)
spam

X= spam['v2']
y = spam['label']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3, 
random_state =1, stratify =y)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
(3900,) (3900,)
(1672,) (1672,)
from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(binary =True)
x_traincv = cv.fit_transform(x_train)
x_traincv.shape
(3900, 7175)
  • 베르누이 나이브베이즈는 이산형으로(1,0) 구성된 데이터를 입력으로 받기에, CountVectorizer를 사용
  • binary를 True로 설정시 이메일마다 단어가 한번 이상 등장하면 1 그렇지 않으면 0으로 설정
encoded_input = x_traincv.toarray()
encoded_input
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])
print(cv.inverse_transform(encoded_input[[0]]))
[array(['couple', 'down', 'give', 'me', 'minutes', 'my', 'sure', 'to',
       'track', 'wallet', 'yeah'], dtype='<U34')]
  • inverse_transform: 벡터로 인코딩된 이메일 제목에 어떤 단어들이 포함되어 있는지 확인함
print(cv.get_feature_names_out()[1000 :1010 ], end ='')
['at' 'ate' 'athletic' 'athome' 'atlanta' 'atlast' 'atm' 'attach'
 'attached' 'attack']
  • get_feature_names_out: 벡터의 인덱스가 어떤 단어를 뜻하는지 확인
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
bnb.fit(x_traincv, y_train)
x_testcv = cv.transform(x_test)
pred = bnb.predict(x_testcv)
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
print("Accuracy Score : ", acc)
Accuracy Score :  0.9754784688995215
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))
              precision    recall  f1-score   support

           0       0.97      1.00      0.99      1448
           1       0.99      0.82      0.90       224

    accuracy                           0.98      1672
   macro avg       0.98      0.91      0.94      1672
weighted avg       0.98      0.98      0.97      1672

5. MultinomialNB

5.1 Parameters

class sklearn.naive_bayes.MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None)
  • 위와 동일

5.2 Methods

  • 위와 동일

5.3 Implementation

😗 데이터 불러오기

from keras.datasets import imdb
(X_train, y_train), (X_test, y_test) = imdb.load_data()
print(X_train.shape)
print(X_test.shape)
(25000,)
(25000,)
import pandas as pd

word_to_index = imdb.get_word_index()
index_to_word = {}
for key, value in word_to_index.items():
 index_to_word[value+3] = key

for index, token in enumerate(("<pad>", "<sos>", "<unk>")):
 index_to_word[index] = token

train_reviews = []
for X in X_train:
 tmp =' '.join([index_to_word[index] for index in X])
 train_reviews.append(tmp)

test_reviews = []
for X in X_test:
 tmp =' '.join([index_to_word[index] for index in X])
 test_reviews.append(tmp)

train = pd.concat([pd.DataFrame(train_reviews), pd.DataFrame(y_train)], axis =1)
train.columns = ['reviews', 'label']
train['reviews'] = train['reviews'].str[6:]
test = pd.concat([pd.DataFrame(test_reviews), pd.DataFrame(y_test)], axis=1)
test.columns = ['reviews', 'label']
train['reviews'] = train['reviews'].str[6:]
print("<<<<<<<<< Train Dataset for MNB >>>>>>>>>")
train.head()

print("<<<<<<<<< Test Dataset for MNB >>>>>>>>>")
test.head()

x_train, x_test = train['reviews'].values, test['reviews'].values
y_train, y_test = train['label'].values, test['label'].values
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
(25000,) (25000,)
(25000,) (25000,)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary=False)
x_traincv = cv.fit_transform(x_train)
x_traincv.shape
(25000, 76521)
print(cv.inverse_transform(x_traincv)[0 ])
['ilm' 'was' 'just' 'brilliant' 'casting' 'location' 'scenery' 'story'
 'direction' 'everyone' 'really' 'suited' 'the' 'part' 'they' 'played'
 'and' 'you' 'could' 'imagine' 'being' 'there' 'robert' 'redford' 'is'
 'an' 'amazing' 'actor' 'now' 'same' 'director' 'norman' 'father' 'came'
 'from' 'scottish' 'island' 'as' 'myself' 'so' 'loved' 'fact' 'real'
 'connection' 'with' 'this' 'film' 'witty' 'remarks' 'throughout' 'were'
 'great' 'it' 'much' 'that' 'bought' 'soon' 'released' 'for' 'retail'
 'would' 'recommend' 'to' 'watch' 'fly' 'fishing' 'cried' 'at' 'end' 'sad'
 'know' 'what' 'say' 'if' 'cry' 'must' 'have' 'been' 'good' 'definitely'
 'also' 'congratulations' 'two' 'little' 'boy' 'of' 'paul' 'children'
 'are' 'often' 'left' 'out' 'praising' 'list' 'think' 'because' 'stars'
 'play' 'them' 'all' 'grown' 'up' 'such' 'big' 'profile' 'whole' 'but'
 'these' 'should' 'be' 'praised' 'done' 'don' 'lovely' 'true' 'someone'
 'life' 'after' 'shared' 'us']
print (cv.get_feature_names_out()[-10 :])
['était' 'état' 'étc' 'évery' 'êxtase' 'ís' 'ísnt' 'østbye' 'über'
 'üvegtigris']
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(x_traincv, y_train)
from sklearn.metrics import accuracy_score, classification_report
x_testcv = cv.transform(x_test)
pred = mnb.predict(x_testcv)
acc = accuracy_score(y_test, pred)
print("Accuracy Score : ", acc)
Accuracy Score :  0.81932
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))
              precision    recall  f1-score   support

           0       0.79      0.87      0.83     12500
           1       0.85      0.77      0.81     12500

    accuracy                           0.82     25000
   macro avg       0.82      0.82      0.82     25000
weighted avg       0.82      0.82      0.82     25000

댓글남기기