ADP 실기 11장 Naive Bayes
본 게시물은 Naive Bayes에 대해 소개한다.
1. Bayes` Theorem
1.1 Concept
두 확률 변수의 사전 확률과 사후 확률 사이의 관계를 나타내는 정리
1.2 Formular
- Posterior: 사건 B가 발생했을 때, 사건 A가 발생할 확률로 보통 문제에서 알고자하는 것
-
P(A B)
-
- Likelihood: 사건 A가 발생했을 때, 사건 B가 발생활 확률
-
P(B A)
-
- Prior: 사건 A가 발생할 확률
- P(A)
- Evidence: 관찰값, 사건 B가 발생할 확률
- P(B)
2. Naive Bayes
2.1 Concept
하나의 속성값을 기준으로 다른 속성이 독립적이라 전제했을 때 해당 속성값이 클래스 분류에 미치는 영향을 측정
2.2 Condition
- 데이터가 많지 않을 때: 데이터가 적으면 통계적 기법의 신뢰도가 떨어짐
- 목적이 미래 예측: 하나의 추정치를 고집하는 것이 아니라 값을 수정하며 현실적인 추정치를 찾아감
2.3 Kind
- BernoulliNB: 이진 분류 (이산형)
- MultinomialNB: 카운트 데이터 (이산형)
- GaussianNB: 데이터가 연속적이며 정규분포라는 가정 하에 적용 (연속형)
3. GaussianNB
3.1 Parameters
class sklearn.naive_bayes.GaussianNB(*, priors=None, var_smoothing=1e-09)
- priors: 클래스의 사전 확률
3.2 Methods
- fit(X, y)
- predict(X)
- predict_proba(X)
- score(X, y)
3.3 Implementation
😗 데이터 불러오기
import pandas as pd
sky = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/Skyserver.csv")
sky.info()
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 objid 10000 non-null float64
1 ra 10000 non-null float64
2 dec 10000 non-null float64
3 u 10000 non-null float64
4 g 10000 non-null float64
5 r 10000 non-null float64
6 i 10000 non-null float64
7 z 10000 non-null float64
8 run 10000 non-null int64
9 rerun 10000 non-null int64
10 camcol 10000 non-null int64
11 field 10000 non-null int64
12 specobjid 10000 non-null float64
13 class 10000 non-null object
14 redshift 10000 non-null float64
15 plate 10000 non-null int64
16 mjd 10000 non-null int64
17 fiberid 10000 non-null int64
dtypes: float64(10), int64(7), object(1)
sky['class'].unique()
array(['STAR', 'GALAXY', 'QSO'], dtype=object)
import seaborn as sns
sns.pairplot(hue='class', data =sky[['z', 'run', 'i', 'class']])
import numpy as np
features = list(sky.columns)
features.remove('class')
X = sky[features]
y = sky['class']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3,
random_state =1, stratify =y)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
(7000, 17) (7000,)
(3000, 17) (3000,)
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
pred = gnb.fit(x_train, y_train).predict(x_test)
print("Accuracy Score : ", gnb.score(x_test, y_test))
Accuracy Score : 0.799
gnb.predict_proba(x_test)[[0, 13, 68]]
array([[8.26737014e-01, 4.43137039e-02, 1.28949282e-01],
[5.39851854e-05, 9.64092748e-02, 9.03536740e-01],
[8.32868012e-01, 4.48282737e-02, 1.22303715e-01]])
gnb.predict(x_test)[[0, 13, 68]]
array(['GALAXY', 'STAR', 'GALAXY'], dtype='<U6')
from sklearn.metrics import classification_report
pred=gnb.predict(x_test)
print(classification_report(y_test, pred))
precision recall f1-score support
GALAXY 0.74 0.97 0.84 1499
QSO 0.00 0.00 0.00 255
STAR 0.91 0.75 0.83 1246
accuracy 0.80 3000
macro avg 0.55 0.58 0.56 3000
weighted avg 0.75 0.80 0.76 3000
4. BernoulliNB
4.1 Parameters
class sklearn.naive_bayes.BernoulliNB(*, alpha=1.0, fit_prior=True, class_prior=None)
- fit_prior: 클래스의 사전확률 학습 여부, False면 균등확률
- class_prior: 클래스의 사전 확률
4.2 Methods
- Gaussian과 동일
4.3 Implementation
😗 데이터 불러오기
import pandas as pd
spam = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/spam.csv", encoding ='utf-8')
spam.isna().sum()
v1 0
v2 0
Unnamed: 2 5522
Unnamed: 3 5560
Unnamed: 4 5566
spam=spam[['v1', 'v2']]
spam
spam['v1'].unique()
array(['ham', 'spam'], dtype=object)
import numpy as np
spam['label'] = np.where(spam['v1']=='spam', 1, 0)
spam
X= spam['v2']
y = spam['label']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3,
random_state =1, stratify =y)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
(3900,) (3900,)
(1672,) (1672,)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary =True)
x_traincv = cv.fit_transform(x_train)
x_traincv.shape
(3900, 7175)
- 베르누이 나이브베이즈는 이산형으로(1,0) 구성된 데이터를 입력으로 받기에, CountVectorizer를 사용
- binary를 True로 설정시 이메일마다 단어가 한번 이상 등장하면 1 그렇지 않으면 0으로 설정
encoded_input = x_traincv.toarray()
encoded_input
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
print(cv.inverse_transform(encoded_input[[0]]))
[array(['couple', 'down', 'give', 'me', 'minutes', 'my', 'sure', 'to',
'track', 'wallet', 'yeah'], dtype='<U34')]
- inverse_transform: 벡터로 인코딩된 이메일 제목에 어떤 단어들이 포함되어 있는지 확인함
print(cv.get_feature_names_out()[1000 :1010 ], end ='')
['at' 'ate' 'athletic' 'athome' 'atlanta' 'atlast' 'atm' 'attach'
'attached' 'attack']
- get_feature_names_out: 벡터의 인덱스가 어떤 단어를 뜻하는지 확인
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
bnb.fit(x_traincv, y_train)
x_testcv = cv.transform(x_test)
pred = bnb.predict(x_testcv)
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
print("Accuracy Score : ", acc)
Accuracy Score : 0.9754784688995215
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))
precision recall f1-score support
0 0.97 1.00 0.99 1448
1 0.99 0.82 0.90 224
accuracy 0.98 1672
macro avg 0.98 0.91 0.94 1672
weighted avg 0.98 0.98 0.97 1672
5. MultinomialNB
5.1 Parameters
class sklearn.naive_bayes.MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None)
- 위와 동일
5.2 Methods
- 위와 동일
5.3 Implementation
😗 데이터 불러오기
from keras.datasets import imdb
(X_train, y_train), (X_test, y_test) = imdb.load_data()
print(X_train.shape)
print(X_test.shape)
(25000,)
(25000,)
import pandas as pd
word_to_index = imdb.get_word_index()
index_to_word = {}
for key, value in word_to_index.items():
index_to_word[value+3] = key
for index, token in enumerate(("<pad>", "<sos>", "<unk>")):
index_to_word[index] = token
train_reviews = []
for X in X_train:
tmp =' '.join([index_to_word[index] for index in X])
train_reviews.append(tmp)
test_reviews = []
for X in X_test:
tmp =' '.join([index_to_word[index] for index in X])
test_reviews.append(tmp)
train = pd.concat([pd.DataFrame(train_reviews), pd.DataFrame(y_train)], axis =1)
train.columns = ['reviews', 'label']
train['reviews'] = train['reviews'].str[6:]
test = pd.concat([pd.DataFrame(test_reviews), pd.DataFrame(y_test)], axis=1)
test.columns = ['reviews', 'label']
train['reviews'] = train['reviews'].str[6:]
print("<<<<<<<<< Train Dataset for MNB >>>>>>>>>")
train.head()
print("<<<<<<<<< Test Dataset for MNB >>>>>>>>>")
test.head()
x_train, x_test = train['reviews'].values, test['reviews'].values
y_train, y_test = train['label'].values, test['label'].values
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
(25000,) (25000,)
(25000,) (25000,)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary=False)
x_traincv = cv.fit_transform(x_train)
x_traincv.shape
(25000, 76521)
print(cv.inverse_transform(x_traincv)[0 ])
['ilm' 'was' 'just' 'brilliant' 'casting' 'location' 'scenery' 'story'
'direction' 'everyone' 'really' 'suited' 'the' 'part' 'they' 'played'
'and' 'you' 'could' 'imagine' 'being' 'there' 'robert' 'redford' 'is'
'an' 'amazing' 'actor' 'now' 'same' 'director' 'norman' 'father' 'came'
'from' 'scottish' 'island' 'as' 'myself' 'so' 'loved' 'fact' 'real'
'connection' 'with' 'this' 'film' 'witty' 'remarks' 'throughout' 'were'
'great' 'it' 'much' 'that' 'bought' 'soon' 'released' 'for' 'retail'
'would' 'recommend' 'to' 'watch' 'fly' 'fishing' 'cried' 'at' 'end' 'sad'
'know' 'what' 'say' 'if' 'cry' 'must' 'have' 'been' 'good' 'definitely'
'also' 'congratulations' 'two' 'little' 'boy' 'of' 'paul' 'children'
'are' 'often' 'left' 'out' 'praising' 'list' 'think' 'because' 'stars'
'play' 'them' 'all' 'grown' 'up' 'such' 'big' 'profile' 'whole' 'but'
'these' 'should' 'be' 'praised' 'done' 'don' 'lovely' 'true' 'someone'
'life' 'after' 'shared' 'us']
print (cv.get_feature_names_out()[-10 :])
['était' 'état' 'étc' 'évery' 'êxtase' 'ís' 'ísnt' 'østbye' 'über'
'üvegtigris']
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(x_traincv, y_train)
from sklearn.metrics import accuracy_score, classification_report
x_testcv = cv.transform(x_test)
pred = mnb.predict(x_testcv)
acc = accuracy_score(y_test, pred)
print("Accuracy Score : ", acc)
Accuracy Score : 0.81932
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))
precision recall f1-score support
0 0.79 0.87 0.83 12500
1 0.85 0.77 0.81 12500
accuracy 0.82 25000
macro avg 0.82 0.82 0.82 25000
weighted avg 0.82 0.82 0.82 25000
댓글남기기