๐Ÿ“Š Data Analysis/๐ŸŽฏ Project

์‹ฌํ™”ํ”„๋กœ์ ํŠธ : ํšŒ๊ท€๋ถ„์„์„ ํ™œ์šฉํ•œ ์›”๋งˆํŠธ ์ฃผ๊ฐ„ ํŒ๋งค๋Ÿ‰ ์˜ˆ์ธก(2)

ny:D 2024. 6. 26. 18:37

์‹ฌํ™”ํ”„๋กœ์ ํŠธ : ํšŒ๊ท€๋ถ„์„์„ ํ™œ์šฉํ•œ ์›”๋งˆํŠธ ์ฃผ๊ฐ„ ํŒ๋งค๋Ÿ‰ ์˜ˆ์ธก

 

โœ… Weekly Sales์˜ ์Œ์ˆ˜๊ฐ’, ๊ณผ์—ฐ ์ด์ƒ์น˜(์˜ค๊ธฐ์ž…)๋กœ ๋ด์•ผํ•˜๋Š”๊ฐ€?

์•„๋‹ˆ๋‹ค. Sales ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ, ํ™˜๋ถˆ์ด๋‚˜ ํŒŒ์† ๋“ฑ์˜ ์‚ฌ์œ ๋กœ ์Œ์ˆ˜๊ฐ’์ด ๋‚˜์˜ฌ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์Œ์ˆ˜๊ฐ’์„ ์ด์ƒ์น˜๋กœ ๋ณผ ์ˆ˜ ์—†๋‹ค.

 

โœ”๏ธ ์ดˆ๊ธฐ ๋ณ€์ˆ˜ ์„ ํƒ

1. Markdown 1~5 ์ปฌ๋Ÿผ์„ ์‚ฌ์šฉํ•ด์•ผํ•˜๋Š”๊ฐ€?

์•„๋ž˜์˜ ์ด์œ ๋“ค๋กœ Markdown 1~5 ์ปฌ๋Ÿผ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ธฐ๋กœ ํ–ˆ๋‹ค.

  • ๊ฒฐ์ธก์น˜์˜ ๋น„์œจ์ด ๋„ˆ๋ฌด ๋†’๋‹ค. → ์ „์ฒด์˜ 64%๊ฐ€ ๊ฒฐ์ธก์น˜
  • ์ปฌ๋Ÿผ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค → ํ•ด๋‹น ํ’ˆ๋ชฉ์— ๋Œ€ํ•œ sales์˜ ํ•ฉ๊ณ„์ธ์ง€ ํ•ด๋‹น ์ผ์ž์— ๋Œ€ํ•œ weekly sales ํ•ฉ๊ณ„์ธ์ง€ ๋ถˆ๋ช…ํ™•ํ•จ.

2. Type, Size ๋ณ€์ˆ˜๋ฅผ ๋ชจ๋‘ ์‚ฌ์šฉํ•ด์•ผํ•˜๋Š”๊ฐ€?

Type, Size ๋‘ ๋ณ€์ˆ˜๊ฐ„ ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ -0.81๋กœ ๋‹ค์ค‘๊ณต์„ ์„ฑ ๋ฌธ์ œ๊ฐ€ ์ผ์–ด๋‚  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํŒ๋‹จ, ๋‘ ๋ณ€์ˆ˜์ค‘ ํ•˜๋‚˜์˜ ๋ณ€์ˆ˜๋งŒ ์„ ํƒํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ๋‹ค.

  • Weekly_Sales ๋ณ€์ˆ˜์— ๋” ์˜ํ–ฅ์„ ๋งŽ์ด ์ฃผ๋Š” ๋ณ€์ˆ˜๋Š” Size์ด๋‹ค. 

df[['Size', 'Type_le']].corrwith(df['Weekly_Sales'])
    • Type๋ณ„ Size๋ฅผ ๋‚˜ํƒ€๋ƒˆ์„ ๋•Œ, ๋™์ผ Type๋‚ด์—์„œ ๊ณผ๋„ํ•˜๊ฒŒ Size๊ฐ€ ์ž‘์€ Store๋“ค์ด ์กด์žฌ. A์™€ B์˜ ์ด์ƒ์น˜๋“ค์€ ์˜คํžˆ๋ ค C์™€ ๊ฐ€๊ฒฉ ํ‰๊ท ์ด ๋น„์Šทํ•œ ์–‘์ƒ์„ ๋”.

๋”๋ณด๊ธฐ
# Size๋ฅผ ๊ธฐ์ค€์œผ๋กœ store df ์ •๋ ฌ
stores.sort_values(by='Size', ascending=False).reset_index()

# ๊ทธ๋ž˜ํ”„๋กœ ๋ถ„ํฌ ํ™•์ธ
sns.histplot(data=stores, x='Size', hue='Type', bins = 10)
plt.axvline(x=75000, linestyle='--', color='r') 
plt.axvline(x=145000, linestyle='--', color='r')

Boxplot์œผ๋กœ ํ™•์ธํ•ด๋ณธ ๊ฒฐ๊ณผ, A, B ๊ฐ๊ฐ์— ์ด์ƒ์น˜๊ฐ€ ์กด์žฌํ–ˆ๋‹ค. ์ด ์ด์ƒ์น˜ ๊ทธ๋ฃน์„ Type์— ๋”ฐ๋ผ (out_a, out_b)๋กœ ๋‚˜๋ˆ„์–ด ํ‰๊ท ์„ ํ™•์ธํ•œ ๊ฒฐ๊ณผ, A์™€ B์˜ ์‚ฌ์ด์ฆˆ ๋ณด๋‹ค๋Š” C์˜ ์‚ฌ์ด์ฆˆ์™€ ๋” ๋น„์Šทํ•˜๋‹ค๋Š” ์‚ฌ์‹ค์„ ํ™•์ธํ–ˆ๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ Type ๋ณ€์ˆ˜ ๋Œ€์‹  Size ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ๋‹ค.

๐Ÿงจ ML ๋ชจ๋ธ ์„ค๊ณ„

๊ธฐ๋ณธ ํ…Œ์ด๋ธ”

df

๋”๋ณด๊ธฐ
# Type, MarkDown1-5 ์ œ๊ฑฐ
df = train.copy()
mask = ['Type','MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']

df = train.drop(columns=mask)

# ๋‚ ์งœ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
df["Date"] = pd.to_datetime(df["Date"])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Week'] = df['Date'].apply(lambda x: x.isocalendar()[1])

 

์ธ์ฝ”๋”ฉ

df

๋ชจ๋ธ๋ง์„ ์œ„ํ•ด ๋ถˆ๋ฆฌ์–ธ ์ธ๋ฑ์‹ฑ ๋˜์–ด์žˆ๋˜ IsHoliday ๋ณ€์ˆ˜์™€, ๋ชจ๋ธ๋ง ์‹œ ์˜ค์ฐจ ๊ณ„์‚ฐ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด Year ๋ณ€์ˆ˜๋ฅผ ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉํ•˜์˜€๋‹ค. ์ถ”ํ›„ ๋ถ„์„์‹œ ์›๋ณธ ์ปฌ๋Ÿผ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์–ด Year_le, IsHoliday_le์™€ ๊ฐ™์ด ๋ผ๋ฒจ๋ง๋œ ๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ธ์ฝ”๋”ฉ์„ ์ง„ํ–‰ํ–ˆ๋‹ค.

๋”๋ณด๊ธฐ
# isHoliday, Year ์ธ์ฝ”๋”ฉ
from sklearn.preprocessing import LabelEncoder

le1 = LabelEncoder()
le2 = LabelEncoder()

df['IsHoliday_le'] = le1.fit_transform(df['IsHoliday'])
df['Year_le'] = le2.fit_transform(df['Year'])

์Šค์ผ€์ผ๋ง

์Šค์ผ€์ผ๋ง์„ ํ•˜๊ธฐ์— ์•ž์„œ, ๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ์— ์ ํ•ฉํ•œ ์Šค์ผ€์ผ๋Ÿฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด Store ํ…Œ์ด๋ธ”์— ์žˆ๋˜ ํ”ผ์ฒ˜ Size์™€, Features ํ…Œ์ด๋ธ”์— ์žˆ๋˜ feature๋“ค์— ๋Œ€ํ•ด ์ถ”๊ฐ€์ ์œผ๋กœ ์ด์ƒ์น˜ ๋ฐ ๋ถ„ํฌ ํƒ์ƒ‰์„ ์ง„ํ–‰ํ–ˆ๋‹ค.

  Lower Bound Upper Bound
Temperature 5.28 115.68
Fuel Price 1.73 4.95
CPI 11.43 333.01
Unemployment 4.37 11.09

๋ณด๋‹ค์‹œํ”ผ Temperature์™€ Unemployment์˜ ๊ฒฝ์šฐ ์ด์ƒ์น˜๊ฐ€ ์กด์žฌํ•จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.  ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ๋Œ€ํ•œ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•๋ณดํ•˜๊ธฐ ์œ„ํ•ด ์ด์ƒ์น˜๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ ๋ณด๋‹ค๋Š”, ์ด์ƒ์น˜์— ๊ฐ•๊ฑดํ•œ Standard Scaler๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค.

๋˜ํ•œ, ํ…Œ์ด๋ธ” ์กฐ์ธ์‹œ Dept๋“ฑ์˜ ์˜ํ–ฅ์œผ๋กœ ๋ถ„ํฌ์— ์˜ํ–ฅ์ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด, ์กฐ์ธ๋œ ํ…Œ์ด๋ธ”์—์„œ์˜ ๋ณ€์ˆ˜์˜ ๋ถ„ํฌ์™€ ์กฐ์ธ๋˜๊ธฐ ์ „ ์›๋ณธ(Original) ํ…Œ์ด๋ธ”์—์„œ ๋ณ€์ˆ˜์˜ ๋ถ„ํฌ๋ฅผ ๋น„๊ตํ•ด๋ณด์•˜๋‹ค. 

๋”๋ณด๊ธฐ
sns.kdeplot(data=df, x='Fuel_Price_sd', color = '#57AEFF', fill = True)
sns.kdeplot(data=features, x='Fuel_Price_sd', color = '#FFC220', fill = True)
plt.axvline(x=-2, linestyle = '--', color = '#EB5757')
plt.axvline(x=-0.5, linestyle = '--', color = '#EB5757')
plt.title('Distribution of Fuel_Price(Joined vs. Original)')
plt.legend(['Joined','Original'])

ํ™•์ธ ๊ฒฐ๊ณผ ์›๋ž˜์˜ ํ…Œ์ด๋ธ”๊ณผ ์กฐ์ธ๋œ ํ…Œ์ด๋ธ”์˜ ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ๋”ฐ๋ผ์„œ ์›๋ž˜์˜ ๋ถ„ํฌ๋ฅผ ์‚ด๋ฆฌ๊ธฐ ์œ„ํ•ด์„œ Fit์€ ์›๋ž˜ ํ…Œ์ด๋ธ”(๊ฐ๊ฐ Stores, Features)์—์„œ ์ง„ํ–‰ํ•˜๊ณ  Transform์€ ์กฐ์ธ๋œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ์ ์šฉํ•˜์˜€๋‹ค. 

df

๋”ฐ๋ผ์„œ ๋ชจ๋ธ๋ง์„ ์œ„ํ•ด Size, Temperature, Fuel_Price, CPI, Unemployment ํ”ผ์ฒ˜๋“ค์ด ์œ„์™€ ๊ฐ™์ด ์Šค์ผ€์ผ๋ง๋˜์—ˆ๋‹ค.

๋”๋ณด๊ธฐ
# ์Šค์ผ€์ผ๋ง
# ๋‚˜์˜
# Store-Size Scaling
from sklearn.preprocessing import StandardScaler
store_sd = StandardScaler()
store_sd.fit(store[['Size']])

# Store-Size Scaler df์— ์ ์šฉ
df['Size_sd'] = store_sd.transform(df[['Size']])

# ์•ˆํ•ด
# ์Šค์ผ€์ผ๋งํ•  ์—ด ๋ชฉ๋ก
columns_to_scale = ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment']
# ์Šค์ผ€์ผ๋ง ์ง„ํ–‰
scaler = StandardScaler()
scaler.fit(features[columns_to_scale])
#๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—ด์— ๋ถ™์ด๊ธฐ(์ปฌ๋Ÿผ๋ช… ๋งŒ๋“ค๊ธฐ)
col_names = ['Temperature_sd', 'Fuel_Price_sd', 'CPI_sd', 'Unemployment_sd']
#4๊ฐœ ์ปฌ๋Ÿผ ์ƒ์„ฑ ํŠธ๋žœ์Šคํผ.
df[col_names] = scaler.transform(df[columns_to_scale])