컴공생 누르지 마세요! 컴공생 울어요.

[ML] HA1 part1 (3) Gaussian discriminant analysis (Gaussian Naive Bayes) with iris dataset 본문

STUDY/기계학습

[ML] HA1 part1 (3) Gaussian discriminant analysis (Gaussian Naive Bayes) with iris dataset

당도최고치악산멜론 2022. 12. 19. 14:34

📢 학교 수업에서 수행한 과제입니다.

 

이번 게시글에서는 iris dataset에 대해 Gaussian Discriminant Analysis를 이용해 학습시킬 것이다.

구글 코랩을 이용하였으며, 전체 코드는 지난 첫번째 게시글을 참고하라.

 

* 지난 게시글

[ML] HA1 part1 (1) Linear Regression with Startup dataset

https://kwonppo.tistory.com/34

 

[ML] HA1 part1 (1) Linear Regression with Startup dataset

HA1 part1은 Linear Regression, Logistice Regression, GDA를 이용한 practice이다. 이번 게시글에서는 startup dataset을 이용한 Linear Regression을 수행할 것이다. 0. 실행 환경 구글에서 제공하는 jupyter notebook인 구글

kwonppo.tistory.com

[ML] HA1 part1 (2) Logistic regression with Titanic dataset

https://kwonppo.tistory.com/35

 

[ML] HA1 part1 (2) Logistic regression with Titanic dataset

이번 게시글에서는 titanic dataset에 대해 Logistic Regression model을 학습시킬 것이다. 구글 코랩을 이용하였으며, 전체 코드는 지난 게시글을 참고하라. * 지난 게시글 [ML] HA1 part1 (1) Linear Regression with St

kwonppo.tistory.com


3. Gaussian discriminant analysis (Gaussian Naive Bayes) with iris dataset

우선 iris.csv 데이터셋을 불러온다.

!wget https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv
df3 = pd.read_csv('iris.csv', names=["SepalLength","SepalWidth","PetalLength","PetalWidth","Species"])
df3.head()

위와 같은 feature들로 이루어져있음을 확인할 수 있다.

 

다음으로 데이터 preprocessing을 수행할 것이다.

df3['SepalLength'] = df3['SepalLength'].astype(float)
df3['SepalWidth'] = df3['SepalWidth'].astype(float)
df3['PetalLength'] = df3['PetalLength'].astype(float)
df3['PetalWidth'] = df3['PetalWidth'].astype(float)

df3['Species'] = df3['Species'].apply(lambda x: 0 if x == "Iris-virginica" else (1 if x=="Iris-versicolor" else 2))
X = df3[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]
y = df3['Species']

이제 본격적으로 GDA를 이용하여 SepalLength, SepalWidth, PetalLength, PetalWidth 이 4가지 feature를 이용하여 Specieis를 predict하는 모델을 학습해보자.

 

우선 데이터셋을 train set과 test set으로 나눈다.

from sklearn.model_selection import train_test_split

# Split the dataset
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size= 0.25, random_state=0)

train set을 이용하여 각 클래스의 mean과 std를 확인해보자.

# Split the dataset by class values, returns a dictionary
def separate_by_class(x, y):
  data_dict = dict()
  for i in range(len(x)):
    vector = x[i]
    class_value = y[i]
    if (class_value not in data_dict):
      data_dict[class_value] = list()
    data_dict[class_value].append(vector)
    
  return data_dict
from math import sqrt
 
# compute the standard deviation of a list of numbers
def stdev(numbers):
	avg = mean(numbers)
	variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
	return sqrt(variance)
 
# compute the mean of a list of numbers
def mean(numbers):
	return sum(numbers)/float(len(numbers))

# Calculate the mean, stdev and count for each column in a dataset
def mean_std_for_all(dataset):
	summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)]
	del(summaries[-1])
	return summaries
def mean_std_per_class(x, y):
  separated = separate_by_class(x, y)
  summaries = dict()
  for class_value, rows in separated.items():
    print(class_value, rows)
    summaries[class_value] = mean_std_for_all(rows)
  return summaries
summary = mean_std_per_class(np.array(x_train), np.array(y_train))
for label in summary:
	print(label)
	for row in summary[label]:
		print(row)

이제 test set을 이용하여 class probablilties를 계산할 것이다.

# use Gaussian PDF to derive the probability
from math import sqrt
from math import pi
from math import exp

# compute the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
	exponent = exp(-((x-mean)**2 / (2 * stdev**2 )))
	return (1 / (sqrt(2 * pi) * stdev)) * exponent
   
# compute the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries, row):
  total_rows = sum([summaries[label][0][2] for label in summaries]) # summaries[label][0][2]: length
  probabilities = dict()
  for class_value, class_summaries in summaries.items():
    probabilities[class_value] = summaries[class_value][0][2]/float(total_rows) # prior
    
    for i in range(len(class_summaries)):
      mean, stdev, count = class_summaries[i]
      probabilities[class_value] *= calculate_probability(row[i], mean, stdev)
  return probabilities

def predict_label(probs):
  return max(probs, key=probs.get)
arr1 = np.array(x_test)
arr2 = np.array(y_test)

# x_test와 y_test를 한 array로 합치기
test_set = np.column_stack((arr1, arr2))
summaries = mean_std_per_class(arr1, arr2)
predict = list() # final prediction labels를 저장할 리스트

# compute the class probablilites & report the final prediction label for each testing sample
for i in range(len(test_set)):
  probabilities = calculate_class_probabilities(summaries, test_set[i])
  print(i, '번째 testing sample')
  print('class probabilities:', probabilities)
  print('final prediction label:', predict_label(probabilities))
  predict.append(predict_label(probabilities))
  print()
# compute testing accuracy
def accuracy_metric(target, predicted):
	correct = 0
	for i in range(len(target)):
		if target[i] == predicted[i]:
			correct += 1
	return correct / float(len(target)) * 100.0

final_accuracy = accuracy_metric(arr2, predict)
print('Testing accuracy: ', final_accuracy)

testing accuracy가 위와 같이 89% 정도 나왔다.

 

이제 다음 데이터셋애 대해 다시 한번 class prediction probabilities를 확인해보자.

test = [5.7, 2.9, 4.2, 1.3]

probabilities = calculate_class_probabilities(summaries, test)
print('class prediction probabilities: ', probabilities)
print('final prediction label:', predict_label(probabilities))

위와 같은 결과가 나왔다.

Comments