본문 바로가기
네이버클라우드/AI

AI 7일차 (2023-05-16) 인공지능 기초 _머신러닝 - Voting(보팅)

by prometedor 2023. 5. 16.

Voting(보팅)

ㄴ 일반적으로 서로 다른 알고리즘을 가진 분류기를 결합하는 것

(참고 : 배깅의 경우 각각의 분류기가 모두 같은 유형의 알고리즘을 기반으로 함)

하드 보팅 : 각 분류기의 예측 결과를 단순히 다수결(majority voting)로 결정
소프트 보팅 : 각 분류기의 예측 확률을 평균하여 예측을 수행

 

하드 보팅

https://tyami.github.io/assets/images/post/ML/2020-10-06-ensemble/2020-10-06-ensemble-hard-voting.png

각 weak learner들의 예측 결과값을 바탕으로 다수결 투표하는 방식입니다.

 

소프트 보팅

https://tyami.github.io/assets/images/post/ML/2020-10-06-ensemble/2020-10-06-ensemble-soft-voting-average.png

 

ㄴ  weak learner들의 예측 확률값의 평균을 사용

 

https://tyami.github.io/assets/images/post/ML/2020-10-06-ensemble/2020-10-06-ensemble-soft-voting-weighted-sum.png

 

weak learner들에 대한 신뢰도가 다를 경우, 가중치를 부여하여 확률값의 평균이 아닌 가중치 합을 사용

 

lr = LogisticRegression()
knn = KNeighborsClassifier(n_neighbors=8)
rfc = RandomForestClassifier()
xgb = XGBClassifier()

model = VotingClassifier(
			estimators=[('LR', lr), ('KNN', knn), ('RFC', rfc), ('XGB', xgb)],
			voting='soft',
			n_jobs=-1
			)

 

 

분류 모델

ml17_voting_iris.py

import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_iris
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
import time

# 1. 데이터
datasets = load_iris()
x = datasets.data
y = datasets.target

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, shuffle=True, random_state=42
)

# scaler
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)


# 2. 모델
xgb = XGBClassifier(colsample_bylevel = 0, colsample_bynode = 0, 
                    colsample_bytree = 0, gamma = 4, learning_rate = 0.01, 
                    max_depth = 3, min_child_weight = 0, n_estimators = 100, 
                    reg_alpha = 0, reg_lambda = 1, subsample = 0.2)
                    # ml15_gridSearchCV_xgb_iris 에서 찾은 최적의 파라미터 넣기
lgbm = LGBMClassifier()
cat = CatBoostClassifier()

model = VotingClassifier(
    estimators=[('xgb', xgb), ('lgbm', lgbm), ('cat', cat)],
    voting='soft',
    n_jobs=-1
)

# 3. 훈련
start_time = time.time()
model.fit(x_train, y_train)
end_time = time.time() - start_time


# 4. 평가, 예측
# y_predict = model.predict(x_test)
# score = accuracy_score(y_test, y_predict)
# print('voting 결과 : ', score)

classfiers = [cat, xgb, lgbm]
for model in classfiers :
    model.fit(x_train, y_train)
    y_predict = model.predict(x_test)
    score = accuracy_score(y_test, y_predict)
    class_names = model.__class__.__name__
    print('{0} 정확도 : {1: .4f}'.format(class_names, score))

# hard, soft 같은 결과
# CatBoostClassifier 정확도 :  1.0000
# XGBClassifier 정확도 :  1.0000
# LGBMClassifier 정확도 :  1.0000

 

 

회귀 모델

ml17_voting_california.py

import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import VotingRegressor
from sklearn.metrics import r2_score
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
import time

# 1. 데이터
datasets = fetch_california_housing()
x = datasets.data
y = datasets.target

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, shuffle=True, random_state=42
)

# scaler
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)


# [실습] 성능 향상시키기
# 2. 모델
xgb = XGBRegressor()
# ml15_gridSearchCV_xgb_california 에서 찾은 최적의 파라미터 넣으려고 했지만 default 값이 더 좋아서 그대로 둠
lgbm = LGBMRegressor()
cat = CatBoostRegressor(depth = 9, l2_leaf_reg = 5, learning_rate = 0.1)	# 최적의 파라미터 값 찾아 넣음

model = VotingRegressor(
    estimators=[('xgb', xgb), ('lgbm', lgbm), ('cat', cat)],
    # voting='soft', 필요없음
    n_jobs=-1
)


# 3. 훈련
start_time = time.time()
model.fit(x_train, y_train)
end_time = time.time() - start_time


# 4. 평가, 예측
# y_predict = model.predict(x_test)
# score = accuracy_score(y_test, y_predict)
# print('voting 결과 : ', score)

regressor = [cat, xgb, lgbm]
for model in regressor :
    model.fit(x_train, y_train)
    y_predict = model.predict(x_test)
    score = r2_score(y_test, y_predict)
    class_names = model.__class__.__name__
    print('{0} 정확도 : {1: .4f}'.format(class_names, score))


# default
# CatBoostRegressor 정확도 :  0.8492
# XGBRegressor 정확도 :  0.8287 (default 가 좋음)
# LGBMRegressor 정확도 :  0.8365

# catboost
# depth = 9, l2_leaf_reg = 5, learning_rate = 0.1
# CatBoostRegressor 정확도 :  0.8543 (최적)