๐Ÿ“’ Today I Learn/๐Ÿ Python

๋จธ์‹ ๋Ÿฌ๋‹์˜ ์ดํ•ด์™€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ™œ์šฉ (1) ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ์ดˆ / ์„ ํ˜•ํšŒ๊ท€

ny:D 2024. 6. 3. 23:40

240603 Today I Learn

๋จธ์‹ ๋Ÿฌ๋‹์ด๋ž€?

๐Ÿ’ก ๋จธ์‹ ๋Ÿฌ๋‹(Machine Learning, ML)
๊ด€์ธก๋œ ํŒจํ„ด์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜์‚ฌ ๊ฒฐ์ •์„ ํ•˜๊ธฐ ์œ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜. ๊ธฐ์ˆ  ํ†ต๊ณ„ ๋“ฑ์„ ํ†ตํ•˜์—ฌ ์ง‘๊ณ„๋œ ์ •๋ณด๋กœ ์˜์‚ฌ๊ฒฐ์ •์„ ํ–ˆ๋˜ ๊ณผ๊ฑฐ์™€ ๋‹ฌ๋ฆฌ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๊ณผ ์ฒ˜๋ฆฌ ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์œผ๋กœ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์˜ ํŒจํ„ด์„ ์ธ์‹ํ•˜๊ณ  ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์˜ˆ์ธก, ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก .

AI ⊃ ๋จธ์‹ ๋Ÿฌ๋‹ ⊃ ๋”ฅ๋Ÿฌ๋‹

๋จธ์‹ ๋Ÿฌ๋‹์˜ ์ข…๋ฅ˜

  1. ์ง€๋„ํ•™์Šต : ๋ฌธ์ œ์™€ ์ •๋‹ต์„ ๋ชจ๋‘ ์•Œ๋ ค์ฃผ๊ณ  ๊ณต๋ถ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•
    • ์˜ˆ์ธก
    • ๋ถ„๋ฅ˜
  2. ๋น„์ง€๋„ํ•™์Šต : ๋‹ต์„ ์•Œ๋ ค์ฃผ์ง€ ์•Š๊ณ  ๊ณต๋ถ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•
    • ์—ฐ๊ด€๊ทœ์น™
    • ๊ตฐ์ง‘
  3. ๊ฐ•ํ™”ํ•™์Šต : ๋ณด์ƒ์„ ํ†ตํ•ด ์ƒ์€ ์ตœ๋Œ€ํ™”, ๋ฒŒ์€ ์ตœ์†Œํ™” ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ–‰์œ„๋ฅผ ๊ฐ•ํ™”

์„ ํ˜•ํšŒ๊ท€๋ถ„์„ ์ด๋ก 

๐Ÿ’ก์„ ํ˜•ํšŒ๊ท€(Linear Regression)
์ข…์† ๋ณ€์ˆ˜ y์™€ ํ•œ ๊ฐœ ์ด์ƒ์˜ ๋…๋ฆฝ ๋ณ€์ˆ˜ (๋˜๋Š” ์„ค๋ช… ๋ณ€์ˆ˜) X์™€์˜ ์„ ํ˜• ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ํšŒ๊ท€๋ถ„์„ ๊ธฐ๋ฒ•
→ ํšŒ๊ท€ ๊ณ„์ˆ˜ ํ˜น์€ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ’์„ ์•Œ๋ฉด X๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ Y๊ฐ’์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค.

ํ‰๊ฐ€์ง€ํ‘œ

๐Ÿ’ก MSE (Mean Squared Error)
์˜ค์ฐจ ์ œ๊ณฑํ•ฉ(SSE)์„ ๋ฐ์ดํ„ฐ์˜ ์ˆ˜๋กœ ๋‚˜๋ˆˆ ๊ฒƒ

๐Ÿ’ก R-squared
์ „์ฒด ๋ชจํ˜•์—์„œ ํšŒ๊ท€์„ ์œผ๋กœ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ์ •๋„ / ์ „์ฒด y ์˜ ๋ณ€๋™๋Ÿ‰ ์ค‘์— ํšŒ๊ท€๋ชจํ˜•์ด ์„ค๋ช…ํ•˜๋Š” ๋ณ€๋™๋Ÿ‰(SSreg)์˜ ๋น„์œจ

 

์„ ํ˜•ํšŒ๊ท€๋ถ„์„ ์‹ค์Šต

ํ™œ์šฉ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ/ ํ•จ์ˆ˜

# ํ™œ์šฉ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

ํ™œ์šฉ ๋ฐ์ดํ„ฐ body_df

์„ ํ˜•ํšŒ๊ท€์‹ ์ ํ•ฉํ•˜๊ธฐ

# ์„ ํ˜•ํšŒ๊ท€ ๋ชจ๋ธ์„ model_lr๋กœ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
from sklearn.linear_model import LinearRegression
model_lr = LinearRegression()

# ๋ฐ์ดํ„ฐ ํ›ˆ๋ จ
model_lr.fit(X = body_df[['weight']], # 'X='์„ ์ž…๋ ฅํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ๋ฐ˜๋“œ์‹œ ๋Œ€๋ฌธ์ž๋กœ!
             y = body_df[['height']])
             
# ๊ฐ€์ค‘์น˜(w1)
w1 = model_lr.coef_[0][0]
w1 = w1.round(2) #๋ฐ˜์˜ฌ๋ฆผ

# ํŽธํ–ฅ(bias, w0)
w0 = model_lr.intercept_[0]
w0 = w0.round(2) #๋ฐ˜์˜ฌ๋ฆผ

# ํšŒ๊ท€์‹
print(f'y = {w1}x + {w0}')
## y = 0.86x + 109.37

๋ฐฉ๋ฒ• 1 | W0, W1 ๊ฐ’์„ ์ถ”์ถœํ•ด ํšŒ๊ท€์‹์œผ๋กœ predict ๊ณ„์‚ฐํ•˜๊ธฐ

body_df['y_pred1'] = body_df['weight']*w1 + w0

๋ฐฉ๋ฒ• 2| predict ์ด์šฉํ•˜๊ธฐ

y_pred2 = model_lr.predict(body_df[['weight']])

## array([[184.40385835],
##        [179.22878362],
##        [180.09129608],
##        [188.71642061],
##        [186.99139571],
##        [161.97853455],
##        [183.54134589],
##        [166.29109682],
##        [168.87863418],
##        [168.87863418]])

๋ชจ๋ธ ํ‰๊ฐ€ํ•˜๊ธฐ

  • ํšŒ๊ท€ : MSE
  • R-Squared ๊ฐ’์ด 1์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ๋†’์€๊ฒƒ

๋ฐฉ๋ฒ• 1 | ์ง์ ‘ MSE ๊ณ„์‚ฐํ•˜๊ธฐ

# ์˜ˆ์ธก๊ฐ’ ์ปฌ๋Ÿผ(pred) ๋งŒ๋“ค๊ธฐ
body_df['pred'] = body_df['weight']*w1 + w0

# ์‹ค์ œ๊ฐ’(body_df['height'])๊ณผ ์˜ˆ์ธก๊ฐ’(body_df['pred'])์˜ ์ฐจ์ด
body_df['error'] = body_df['height'] - body_df['pred']

# ์—๋Ÿฌ ์ œ๊ณฑ ๊ณ„์‚ฐํ•˜๊ธฐ
body_df['squared_error'] = body_df['error']*body_df['error']

# MSE๊ณ„์‚ฐ ์™„๋ฃŒ 10
body_mse = body_df['squared_error'].sum()/len(body_df)

body_mse
## 10.190499999999975

 

๋ฐฉ๋ฒ• 2 | sklearn ํŒจํ‚ค์ง€ ์ด์šฉํ•˜๊ธฐ

# ํŒจํ‚ค์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
from sklearn.metrics import mean_squared_error, r2_score

# ํ‰๊ฐ€ํ•จ์ˆ˜๋Š” ๊ณตํ†ต์ ์œผ๋กœ ์ •๋‹ต(์‹ค์ œ true), ์˜ˆ์ธก๊ฐ’(pred)
y_true = body_df['height']
y_pred = body_df['pred']

mean_squared_error(y_true, y_pred) ## 10.152939045376318
r2_score(y_true, y_pred) ## 0.8899887415172141

๊ทธ๋ž˜ํ”„์— ํšŒ๊ท€ ์„  ๋‚˜ํƒ€๋‚ด๊ธฐ

# weight์™€ height๊ฐ„์˜ ์‚ฐ์ ๋„(scatter plot) ๊ทธ๋ฆฌ๊ธฐ
sns.scatterplot( data = body_df, x = 'weight', y = 'height')
plt.title('Weight vs Height')
plt.xlabel('weight(kg)')
plt.ylabel('Height (cm)')
plt.show()

# ์‚ฐ์ ๋„ ์œ„์— lineplot(ํšŒ๊ท€์„  ๊ทธ๋ž˜ํ”„) ์ถ”๊ฐ€ํ•˜๊ธฐ
sns.lineplot(data = body_df, x = 'weight', y = 'pred', color = 'red')

์„ ํ˜•ํšŒ๊ท€๋ถ„์„ ์‹ค์Šต - Tips

tips

# ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ์…‹
tips = sns.load_dataset('tips')

 

๐Ÿ™‹ ๋ฌธ์ œ
์‹๋‹น์—์„œ ํŒŒํŠธํƒ€์ž„์œผ๋กœ ์ผํ•˜๊ณ  ์žˆ๋Š” ๋จธ์‹ ์ด๋Š” ์ด๋ฒˆ์—๋Š” tip ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์ ์šฉํ•ด๋ณด๊ธฐ๋กœ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ˆ์„ ๋งŽ์ด ๋ฒŒ๊ณ  ์‹ถ์—ˆ๋˜ ๋จธ์‹ ์ด๋Š” ์ „์ฒด ๊ธˆ์•ก(X)๋ฅผ ์•Œ๋ฉด ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” ํŒ(Y)์— ๋Œ€ํ•œ ํšŒ๊ท€๋ถ„์„์„ ์ง„ํ–‰ํ•ด๋ณผ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.
# ์„ ํ˜•ํšŒ๊ท€ ํ›ˆ๋ จ(์ ํ•ฉ)
from sklearn.linear_model import LinearRegression
model_lr = LinearRegression()

# ์ „์ฒด ๊ธˆ์•ก(X)๋ฅผ ์•Œ๋ฉด ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” ํŒ(Y)์— ๋Œ€ํ•œ ํšŒ๊ท€๋ถ„์„
model_lr.fit(X = tips[['total_bill']], 
             y = tips[['tip']])
             
# ํšŒ๊ท€์‹ ๊ตฌํ•˜๊ธฐ
w1_tips = model_lr.coef_[0][0]
w0_tips = model_lr.intercept_[0]

print(f'y = {w1_tips:.2f}x + {w0_tips:.2f}') ## y = 0.11x + 0.92

# ๋ชจ๋ธ ํ‰๊ฐ€ํ•˜๊ธฐ
from sklearn.metrics import mean_squared_error, r2_score

y_pred_tip = model_lr.predict(tips[['total_bill']]) # ์˜ˆ์ธก๊ฐ’
mse_tip = mean_squared_error(tips['tip'], y_pred_tip) # mse : 1.036019442011377
r_squared_tip = r2_score(tips['tip'], y_pred_tip) #r-squared : 0.45661658635167657
  • mse๋Š” ๊ฐ™์€ ๋ฐ์ดํ„ฐ ์…‹์— ๋Œ€ํ•ด ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ๋น„๊ตํ•˜๋Š”๋ฐ ์‚ฌ์šฉ (→ ๋‹จ์œ„๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์—)
  • r-squared์˜ ๊ฒฝ์šฐ ์•ฝ 0.46์œผ๋กœ ์ข‹์€ ๋ชจ๋ธ์€ ์•„๋‹˜ (1์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ์ข‹์€ ๋ชจ๋ธ)

# ์‚ฐ์ ๋„ ๊ทธ๋ฆฌ๊ธฐ
sns.scatterplot(data=tips, x='total_bill', y = 'tip', alpha = 0.7, color = 'pink')
plt.title('Tip Distribution by Total Bill')

# ํšŒ๊ท€์„  ์ถ”๊ฐ€ํ•˜๊ธฐ
tips['pred'] = y_pred_tip
sns.lineplot(data=tips, x = 'total_bill', y = 'pred', color = 'green')