๐Ÿ“’ Today I Learn/๐Ÿ Python

๋จธ์‹ ๋Ÿฌ๋‹์˜ ์ดํ•ด์™€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ™œ์šฉ (2) ๋‹ค์ค‘์„ ํ˜•ํšŒ๊ท€

ny:D 2024. 6. 4. 16:18

240604 Today I Learn

๋‹ค์ค‘์„ ํ˜•ํšŒ๊ท€

๐Ÿ’ก ๋‹ค์ค‘ ์„ ํ˜•ํšŒ๊ท€(Multiple Linear Regression)
์„ค๋ช… ๋ณ€์ˆ˜(๋…๋ฆฝ๋ณ€์ˆ˜)๊ฐ€ ๋‘ ๊ฐœ ์ด์ƒ์ธ ํšŒ๊ท€ ๋ถ„์„

๋‹ค์ค‘์„ ํ˜• ํšŒ๊ท€ ์‹ค์Šต

tips

๐Ÿ™‹‍โ™€๏ธ ๋จธ์‹ ์ด๋Š” ๋ฐ์ดํ„ฐ ์„ ํ˜•ํšŒ๊ท€๋ฅผ ํ›ˆ๋ จ ์‹œ์ผฐ์ง€๋งŒ ์„ฑ๋Šฅ์ด ๋ณ„๋กœ ์ข‹์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์„ฑ๋ณ„๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์–ด์กŒ์Šต๋‹ˆ๋‹ค. 

1. ์„ฑ๋ณ„๋ฐ์ดํ„ฐ๋Š” ๋ฌธ์žํ˜•์ด์—ฌ์„œ ์ˆซ์ž๋กœ ํ‘œํ˜„ํ•ด์ค˜์•ผ ํ•ฉ๋‹ˆ๋‹ค.

sex_enc ์ปฌ๋Ÿผ์ด ์ถ”๊ฐ€๋œ tips ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„

# 1. male(1), female(0)์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ
def encode_gender(series):
    if series == 'Female':
        return 0
    else:
        return 1
        
# apply๋ฅผ ํ™œ์šฉํ•ด sex ์ปฌ๋Ÿผ์˜ ๋ชจ๋“  ํ–‰์— encode_gender๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
# ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ 'sex_enc'๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
tips['sex_enc'] = tips['sex'].apply(encode_gender)

2. ์ ํ•ฉํ•˜๊ธฐ/ ํšŒ๊ท€์‹ ๊ตฌํ•˜๊ธฐ

# ๋ชจ๋ธ ๋ผˆ๋Œ€(?) ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
model_lr3 = LinearRegression()

# ํšŒ๊ท€์‹ ๊ตฌํ•˜๊ธฐ
# x1 = total_bill, x2 = sex_enc, y = tip

w1 = model_lr3.coef_[0][0]
w2 = model_lr3.coef_[0][1]
w0 = model_lr3.intercept_[0]

print(f'y = {w1:.2f}x1 + {w2:.2f}x2 + {w0:.2f}') ##y = 0.11x1 + -0.03x2 + 0.93

# ๋ชจ๋ธ ํ”ผํŒ…
model_lr3.fit(X=tips[['total_bill','sex_enc']], y=tips[['tip']])

3. ๋‹จ์ˆœ์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ(model_lr2) vs. ๋‹ค์ค‘์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ(model_lr3)

df

  • mse์™€ r-squared ๋ชจ๋‘ multi-linear regression์—์„œ ์กฐ๊ธˆ ๋” ์ž‘์•„์กŒ์œผ๋‚˜ ๊ทธ ํฌ๊ธฐ๊ฐ€ ํฌ์ง€ ์•Š๋‹ค. (๋‘˜๋‹ค ์ฉ ์ข‹์€ ๋ชจ๋ธ์€.. ์•„๋‹ˆ๋‹ค..)
# ์ฒซ๋ฒˆ์งธ ๋ชจ๋ธ (x : total bill, y = tip)
y_pred_tip = model_lr.predict(tips[['total_bill']])
mse_model1 = mean_squared_error(tips['tip'], y_pred_tip) 
r2_model1 = r2_score(tips['tip'], y_pred_tip)

# ๋‘๋ฒˆ์งธ ๋ชจ๋ธ (x1 : total bill, x2 : sex_enc, y = tip)
y_pred_tip2 = model_lr3.predict(tips[['total_bill','sex_enc']])
mse_model2 = mean_squared_error(tips[['tip']], y_pred_tip2)
r2_model2 = r2_score(tips[['tip']], y_pred_tip2)

# ํ™•์ธํ•˜๊ธฐ
df = pd.DataFrame({'model' : ['simple linear','multi linear'],
                   'mse' : [mse_model1, mse_model2],
                   'r_squared': [r2_model1, r2_model2]})
df

  • ์‹ค์ œ๋กœ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณด๋ฉด, ๋‘ ๋ชจ๋ธ์€ ๊ฑฐ์˜ ์ฐจ์ต๊ฐ€ ์—†๊ฒŒ ๋ณด์ธ๋‹ค. 
sns.scatterplot(data=tips, x='total_bill', y = 'tip', alpha = 0.7, color = 'pink')
plt.title('Tip Distribution by Total Bill')

# ํšŒ๊ท€์„  ์ถ”๊ฐ€ํ•˜๊ธฐ model_lr2
sns.lineplot(data=tips, x = 'total_bill', y = 'pred2', color = 'green', alpha = 0.8)

# ํšŒ๊ท€์„  ์ถ”๊ฐ€ํ•˜๊ธฐ model_lr3
sns.lineplot(data=tips, x = 'total_bill', y = 'pred3', color = 'red', alpha = 0.7)

์„ ํ˜•ํšŒ๊ท€ ์ •๋ฆฌ

์„ ํ˜•ํšŒ๊ท€์˜ ๊ฐ€์ •

  • ์„ ํ˜•์„ฑ (Linearity): ์ข…์† ๋ณ€์ˆ˜(Y)์™€ ๋…๋ฆฝ ๋ณ€์ˆ˜(X) ๊ฐ„์— ์„ ํ˜• ๊ด€๊ณ„๊ฐ€ ์กด์žฌํ•ด์•ผ ํ•จ 
  • ๋“ฑ๋ถ„์‚ฐ์„ฑ (Homoscedasticity)์˜ค์ฐจ์˜ ๋ถ„์‚ฐ์ด ๋ชจ๋“  ์ˆ˜์ค€์˜ ๋…๋ฆฝ ๋ณ€์ˆ˜์— ๋Œ€ํ•ด ์ผ์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์˜ค์ฐจ๊ฐ€ ํŠน์ • ํŒจํ„ด์„ ๋ณด์—ฌ์„œ๋Š” ์•ˆ ๋˜๋ฉฐ, ๋…๋ฆฝ ๋ณ€์ˆ˜์˜ ๊ฐ’์— ์ƒ๊ด€์—†์ด ์ผ์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • ์ •๊ทœ์„ฑ (Normality): ์˜ค์ฐจ ํ•ญ์€ ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ผ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ๋…๋ฆฝ์„ฑ (Independence): X๋ณ€์ˆ˜๋Š” ์„œ๋กœ ๋…๋ฆฝ์ ์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿšจ ๋‹ค์ค‘๊ณต์„ ์„ฑ(Multicolinearity)๋ฌธ์ œ
๋ณ€์ˆ˜๊ฐ€ ๋งŽ์•„์ง€๋ฉด ์„œ๋กœ ์—ฐ๊ด€์ด ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ํšŒ๊ท€๋ถ„์„์—์„œ ๋…๋ฆฝ๋ณ€์ˆ˜(X)๊ฐ„์˜ ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒƒ์„ ๋‹ค์ค‘๊ณต์„ ์„ฑ ๋ฌธ์ œ๋ผ๊ณ  ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด Weight, Height๋ฅผ ๊ฐ€์ง€๊ณ  ๋‹ค๋ฅธ Y(์ด๋ฅผ ํ…Œ๋ฉด ๋ฐœ์‚ฌ์ด์ฆˆ)๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค๋ฉด Weight, Height๊ฐ€ ์—ฐ๊ด€์žˆ๋Š” ๋ณ€์ˆ˜์ด๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์ค‘๊ณต์„ ์„ฑ ๋ฌธ์ œ๊ฐ€ ๋‚˜ํƒ€๋‚œ๋‹ค.

์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋‹ค์Œ ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋‹ค.
(1) ์„œ๋กœ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์€ ๋ณ€์ˆ˜ ์ค‘ ํ•˜๋‚˜๋งŒ ์„ ํƒํ•œ๋‹ค.
(2) ๋‘ ๋ณ€์ˆ˜๋ฅผ ๋™์‹œ์— ์„ค๋ช…ํ•˜๋Š” ์ฐจ์›์ถ•์†Œ(Principle Component Analysis, PCA) ์‹คํ–‰ํ•˜์—ฌ ๋ณ€์ˆ˜ 1๊ฐœ๋กœ ์ถ•์†Œํ•œ๋‹ค.

์„ ํ˜•ํšŒ๊ท€ ์ •๋ฆฌ

์žฅ์  ๋‹จ์ 
  • ์ง๊ด€์ ์ด๋ฉฐ ์ดํ•ดํ•˜๊ธฐ ์‰ฝ๋‹ค. X-Y๊ด€๊ณ„๋ฅผ ์ •๋Ÿ‰ํ™” ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋ชจ๋ธ์ด ๋น ๋ฅด๊ฒŒ ํ•™์Šต๋œ๋‹ค(๊ฐ€์ค‘์น˜ ๊ณ„์‚ฐ์ด ๋น ๋ฅด๋‹ค)
  • X-Y๊ฐ„์˜ ์„ ํ˜•์„ฑ ๊ฐ€์ •์ด ํ•„์š”ํ•˜๋‹ค.
  • ํ‰๊ฐ€์ง€ํ‘œ๊ฐ€ ํ‰๊ท (mean)ํฌํ•จ ํ•˜๊ธฐ์— ์ด์ƒ์น˜์— ๋ฏผ๊ฐํ•˜๋‹ค.
  • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ์ธ์ฝ”๋”ฉ์‹œ ์ •๋ณด ์†์‹ค์ด ์ผ์–ด๋‚œ๋‹ค.
  •