๐Ÿ“Š Data Analysis/๐ŸŽฏ Project

๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„ (1)

ny:D 2024. 5. 17. 22:54

๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„

ํ”„๋กœ์ ํŠธ ๊ฐœ์š”

  1. ๋ถ„์„ ๋ชฉ์  : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ์„œ๋น„์Šค์˜ ํ˜„ํ™ฉ์„ ๋ถ„์„ํ•˜๊ณ  ๊ณ ๊ฐ์„ ๋ถ„๋ฅ˜ํ•˜๊ธฐ.
  2. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ : Kaggle

๋ฐ์ดํ„ฐ ์†Œ๊ฐœ

 

 

Bank User Dataset

This dataset contains user behaviors contributing to their credit score

www.kaggle.com

โœ… ์ „์ฒด ์ปฌ๋Ÿผ ์ˆ˜  : 50,000๊ฐœ (๊ฐ™์€ ๊ณ ๊ฐ์˜ 9,10,11,12์›”์น˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์žˆ์–ด ์‹ค์ œ total_user์˜ ์ˆ˜๋Š” 12,500๋ช…)

 

 

๐Ÿ’ก ์–ด๋ ค์› ๋˜ ์  

- int, float ํƒ€์ž…์˜ ๋ฐ์ดํ„ฐ์—ฌ์•ผ ํ•  ๊ฒƒ๋“ค์ด ์–ธ๋”๋ฐ”๊ฐ€ ๋ถ™๋Š”(์ด์ƒ๊ฐ’) ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ์ด๋ฅผ ์ ์ ˆํžˆ ์†Ž์•„๋‚ด๊ธฐ ์–ด๋ ค์›€.

- ๋ฐ์ดํ„ฐ description์ด ์—†๋‹ค ๋ณด๋‹ˆ ๋ฐ์ดํ„ฐ ์ž์ฒด๋ฅผ ์ดํ•ดํ•˜๋Š”๋ฐ ์–ด๋ ค์›€

- ๋ฐ์ดํ„ฐ๊ฐ€ ๊ธฐ๋ฐ˜ํ•˜๊ณ  ์žˆ๋Š” ๋ฏธ๊ตญ์˜ ์€ํ–‰ ๊ณ„์ขŒ ๊ฐœ์„ค ์ฒด๊ณ„์™€ ์šฐ๋ฆฌ๋‚˜๋ผ์˜ ์ฒด๊ณ„๊ฐ€ ๋‹ฌ๋ผ ์ด๋ฅผ ์ดํ•ดํ•˜๋Š”๋ฐ ์–ด๋ ค์› ์Œ

 

์ „์ฒ˜๋ฆฌ

1. ํ•„์š”ํ•œ ์ปฌ๋Ÿผ๋งŒ ๊ณจ๋ผ์„œ ํ™œ์šฉ

# ํ•„์š”ํ•œ ์ปฌ๋Ÿผ๋งŒ ์„ ํƒ (์ด 27๊ฐœ ์ค‘ 22๊ฐœ ์‚ฌ์šฉ)
bank = bank.drop(['Name','SSN','Monthly_Inhand_Salary','Changed_Credit_Limit',
									'Num_Credit_Inquiries'], axis = 1) 

# ์‚ฌ์šฉํ•  ์ปฌ๋Ÿผ
bank.columns
## Index(['ID', 'Customer_ID', 'Month', 'Age', 'Occupation', 'Annual_Income',
##        'Num_Bank_Accounts', 'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan',
##        'Type_of_Loan', 'Delay_from_due_date', 'Num_of_Delayed_Payment',
##        'Credit_Mix', 'Outstanding_Debt', 'Credit_Utilization_Ratio',
##        'Credit_History_Age', 'Payment_of_Min_Amount', 'Total_EMI_per_month',
##        'Amount_invested_monthly', 'Payment_Behaviour', 'Monthly_Balance'],dtype='object')

→ ์ „์ฒด 27๊ฐœ์˜ ์ปฌ๋Ÿผ ์ค‘ 22๊ฐœ์˜ ์ปฌ๋Ÿผ๋งŒ ์„ ํƒ.

 

2. del_underbar_int/ del_underbar_float ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ

์–ธ๋”๋ฐ”๊ฐ€ ํฌํ•จ๋œ int, float ํƒ€์ž…์ด์–ด์•ผ ํ–ˆ๋˜ ๋ฐ์ดํ„ฐ๋“ค์„ ์›๋ž˜ ๋ฐ์ดํ„ฐ ํƒ€์ž…์œผ๋กœ ๋ฐ”๊ฟ”์•ผํ•˜๋Š”๋ฐ, for๋ฌธ์„ ์ด์šฉํ•˜ ์ž‘์—…์„ ๋ฐ˜๋ณตํ•˜๋Š”๊ฒƒ์ด ๋ฒˆ๊ฑฐ๋กญ๋‹ค๊ณ  ๋А๊ปด์ ธ์„œ ํ•จ์ˆ˜๋ฅผ ์ œ์ž‘ํ–ˆ๋‹ค. 

def del_underbar_int(data, col):
    for i, x in enumerate(data[col]):
        if pd.notnull(x) and '_' in x:
            data.at[i, col] = x.strip('_')
    
    data[col] = data[col].fillna(0).astype('int64')

def del_underbar_float(data, col):
    for i, x in enumerate(data[col]):
        if pd.notnull(x) and '_' in x:
            data.at[i, col] = x.strip('_')
    
    data[col] = data[col].fillna(0).astype('float')

 

๋‹ค์Œ ๊ณผ์ œ

โœ… ์–ด๋–ป๊ฒŒ Null ๊ฐ’์„ ์ฑ„์›Œ๋„ฃ์„์ง€ ๊ณ ๋ฏผ!

โœ… ์‹œ๊ฐํ™” ํ•ด๋ณด๊ธฐ (Null๊ฐ’์„ ์ฑ„์›Œ๋„ฃ์ง€ ๋ชปํ•œ๋‹ค๋ฉด ๊ทธ๊ฒƒ์„ ์ œ์™ธํ•˜๊ณ ์„œ๋ผ๋„!)