๐Ÿ“Š Data Analysis/๐ŸŽฏ Project

๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„ (3)

ny:D 2024. 5. 27. 00:16

๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„

 

 

๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„ (2)

๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„  ๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„ (1)๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„ํ”„๋กœ์ ํŠธ

archivenyc.tistory.com

 

์ž๊พธ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋กœ ๋Œ์•„๊ฐ€๋Š” ์ค‘....ใ… ์ด์ง€๋งŒ ์ค‘๊บพ๋งˆ ๋งˆ์ธ๋“œ๋กœ ์ผ๋‹จ ํ•ด๋ณด๋Š” ์ค‘์ด๋‹ค.!

 

๐Ÿงน ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌํ•˜๊ธฐ

๊ฒฐ์ธก๊ฐ’ ์ฐพ๊ธฐ

bank.isna().sum()์„ ํ™œ์šฉํ•ด ์ „์ฒด ์ปฌ๋Ÿผ ์ค‘ ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” ์ปฌ๋Ÿผ์„ ์ฐพ์•˜๋‹ค. ๊ฒฐ์ธก๊ฐ’์ด ๋ฐœ๊ฒฌ๋œ ์ปฌ๋Ÿผ์€ Type of Loan, Num_of_Delayed_Payment, Credit_History_Age, Amount_invested_monthly, Monthly_Balance ์ด๋ ‡๊ฒŒ 5๊ฐœ์˜ ์ปฌ๋Ÿผ์ด์—ˆ๋‹ค. ๊ฐ๊ฐ์˜ ์ปฌ๋Ÿผ์— ๋Œ€ํ•ด ์•Œ๋งž๋Š” ๊ฒฐ์ธก๊ฐ’์„ ๊ณ„์‚ฐํ•ด ์ฑ„์›Œ๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค.

bank.isna().sum()

## ID                             0
...
## Type_of_Loan                5704
## Delay_from_due_date            0
## Num_of_Delayed_Payment      3498
## Credit_Mix                     0
## Outstanding_Debt               0
## Credit_Utilization_Ratio       0
## Credit_History_Age          4470
## Payment_of_Min_Amount          0
## Total_EMI_per_month            0
## Amount_invested_monthly     2271
## Payment_Behaviour              0
## Monthly_Balance              562
## dtype: int64

Type of Loan : 'none'

๋Œ€์ถœ์„ ๋ฐ›์€ ๊ณ ๊ฐ์˜ ๊ฒฝ์šฐ Type of Loan ์ปฌ๋Ÿผ์— '์–ด๋–ค ๋Œ€์ถœ์„ ๋ฐ›์•˜๋Š”์ง€' ๋Œ€์ถœ์˜ ์ข…๋ฅ˜๊ฐ€ ๋ช…๊ธฐ๋˜์–ด์žˆ๋‹ค. ํ™•์ธ ๊ฒฐ๊ณผ, ๊ฒฐ์ธก๊ฐ’์ด ์กด์žฌํ•˜๋Š” ํ–‰๋“ค์€ ๋Œ€์ถœ์˜ ๊ฐœ์ˆ˜๋ฅผ ์„ธ๋Š” 'num_of_loan'์ปฌ๋Ÿผ์— ๊ฐ’์ด ์—†๋Š”๊ฒƒ์„ ํ™•์ธํ–ˆ๊ณ , ๋”ฐ๋ผ์„œ ๋Œ€์ถœ์„ ๋ฐ›์ง€ ์•Š์€๊ณ ๊ฐ์œผ๋กœ ๊ฐ„์ฃผ, 'none'์œผ๋กœ null๊ฐ’์„ ์ฑ„์› ๋‹ค.

 

bank['Type_of_Loan'] = bank['Type_of_Loan'].fillna('none')

Num_of_Delayed_Payment : ํ‰๊ท 

delayed payment๊ฐ€ ๋ฐœ์ƒํ–ˆ๋Š”์ง€, ๋ฐœ์ƒํ•˜์ง€ ์•Š์•˜๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์—  ์ด์ƒ์น˜๋ฅผ ์ œ์™ธ(0~30๋ฒ”์œ„ ์™ธ ๊ฐ’)ํ•ด ๊ณ„์‚ฐํ•œ ํ‰๊ท (mean_delayed_payment)๋กœ ๊ฒฐ์ธก์น˜๋ฅผ ์ฑ„์›Œ๋„ฃ์—ˆ๋‹ค.

# Num_of_Delayed_Payment
cond = (bank['Num_of_Delayed_Payment']>=0) & (bank['Num_of_Delayed_Payment']<=30)
mean_delayed_payment = bank.loc[cond,'Num_of_Delayed_Payment'].mean()
mean_delayed_payment #13.417525098210389

# ๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๊ธฐ
bank['Num_of_Delayed_Payment'] = bank['Num_of_Delayed_Payment'].fillna(mean_delayed_payment)

Credit_History_Age : 0

Credit History Age์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ, ์€ํ–‰์— ๋Œ€ํ•œ ์‹ ์šฉ ๊ธฐ๋ก์ด ์—†๋‹ค๊ณ  ๊ฐ„์ฃผ 0์œผ๋กœ ๊ฐ’์„ ์ฑ„์›Œ๋„ฃ์—ˆ๋‹ค. (์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„์—์„œ year๋‹จ์œ„๋งŒ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด extract_years๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ, ์ด ๊ณผ์ •์—์„œ pd.isnull์„ ํ™œ์šฉํ•ด ๊ฒฐ์ธก์น˜๋ฅผ ์ฑ„์› ๋‹ค.)

def extract_years(data):
  if pd.isnull(data):
    return 0
  result = data.split()[0]
  return result

bank['Credit_History_Age'] = bank['Credit_History_Age'].apply(extract_years)

 

Amount_invested_monthly : customer_id๋ณ„ ํ‰๊ท 

Amount_invested_monthly์˜ ๊ฒฝ์šฐ ๋ฐ˜๋ณต์ ์œผ๋กœ __10000__๊ฐ’์ด ๋“ฑ์žฅํ•ด ์–ธ๋”๋ฐ” ์ œ๊ฑฐ์— ํฌํ•จํ•˜์ง€ ์•Š์•˜๋‹ค. (1. __10000__์ด๋ผ๋Š” ๊ฐ™์€ ๊ฐ’๋“ค์ด 10ํšŒ ์ด์ƒ ๋ฐœ๊ฒฌ 2. customer_id๋ณ„๋กœ ์ƒ˜ํ”Œ์„ ๋ฝ‘์•„ ํ™•์ธํ•ด๋ณด์•˜์„ ๋–„ ๋„ค๋‹ฌ๊ฐ„์˜ ๋ฐ์ดํ„ฐ ์ค‘ ์–ธ๋”๋ฐ”๋ฅผ ์ œ๊ฑฐํ•œ ์ƒํƒœ์ธ 10000์ด ์ง€๋‚˜์น˜๊ฒŒ ํฐ ๊ฐ’์ด๋ผ๋Š” ์‚ฌ์‹ค์„ ํ™•์ธํ•จ)

๋”ฐ๋ผ์„œ __10000__๊ณผ ๊ณต๋ฐฑ๊ฐ’์„ ๋ชจ๋‘ error, ๊ฒฐ์ธก์น˜๋กœ ํŒ๋‹จํ•˜๊ณ  customer_id๋ณ„ ํ‰๊ท ๊ฐ’์„ ๊ณ„์‚ฐํ•ด ์ด๋ฅผ ์ฑ„์›Œ๋„ฃ์—ˆ๋‹ค. 1์ฐจ์ ์œผ๋กœ ๊ฐ’์„ ์ฑ„์›Œ๋„ฃ์€ ๊ฒฐ๊ณผ 4๊ฐœ์˜ ๊ณต๋ฐฑ๊ฐ’์ด ๋ฐœ์ƒํ–ˆ๊ณ , ์ด๋Š” ์•ž์„œ ์ด์ƒ์น˜๋ฅผ ์‚ญ์ œํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์ด์—ˆ๋‹ค. (์ด๋Ÿฌํ•œ ์‚ฌ์œ ๋•Œ๋ฌธ์— ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ๋ฅผ ๋จผ์ € ํ•ด์•ผํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๊นจ๋‹ซ๊ฒŒ ๋˜์—ˆ๋‹ค..!) 4๊ฐœ์˜ ๊ณต๋ฐฑ๊ฐ’์˜ ๊ฒฝ์šฐ ์ฑ„์›Œ๋„ฃ์„ ๊ฐ’์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์•Œ ์ˆ˜ ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์— 0์œผ๋กœ ๊ฐ’์„ ์ฑ„์›Œ๋„ฃ์—ˆ๋‹ค.

# ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ numeric์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ : __10000__๊ฐ’ ๋˜๋Š” ๊ณต๋ฐฑ๊ฐ’์€ ๋ชจ๋‘ error๋กœ ๋ฐ˜ํ™˜
bank['Amount_invested_monthly'] = pd.to_numeric(bank['Amount_invested_monthly'], errors='coerce')

# customer_id๋ณ„ ํ‰๊ท ๊ฐ’์œผ๋กœ error๋‚œ ๊ณณ ์ฑ„์›Œ๋„ฃ๊ธฐ
bank['Amount_invested_monthly'] = bank.groupby('Customer_ID')['Amount_invested_monthly'].transform(lambda x: x.fillna(x.mean()))

# null๊ฐ’ ์„ธ๊ธฐ
bank['Amount_invested_monthly'].isna().sum()  ## ๊ฒฐ๊ณผ 4

# customer_id๊ฐ€ 'CUS_0x311a'์ธ ๊ณ ๊ฐ์˜ ๊ฒฝ์šฐ amount_invested_monthly ๊ฐ’์ด ๋ชจ๋‘ ์—†์Œ -> 0์œผ๋กœ ์ฑ„์šฐ๊ธฐ
bank['Amount_invested_monthly'] = bank['Amount_invested_monthly'].fillna(0)

Monthly_Balance : ํ‰๊ท (mean)

Monthly_balance์˜ ๊ฒฝ์šฐ ์—ญ์‹œ 562๊ฐœ์˜ ๊ฒฐ์ธก์น˜๊ฐ€ ๋ฐœ์ƒํ•˜์˜€๋Š”๋ฐ, ์ด์ƒ์น˜์ธ 0์„ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ๊ฐ’๋“ค์˜ ํ‰๊ท ์œผ๋กœ ๊ฒฐ์ธก๊ฐ’์„ ์ฑ„์›Œ๋„ฃ์—ˆ๋‹ค.

# monthly balance ๊ฒฐ์ธก์น˜ (ํ‰๊ท ์œผ๋กœ ์ฑ„์šฐ๊ธฐ)
mean_balance = bank[bank['Monthly_Balance']>0]['Monthly_Balance'].mean()
bank['Monthly_Balance'] = bank['Monthly_Balance'].fillna(mean_balance)

๐Ÿ—‘๏ธ ์ด์ƒ์น˜ ์ฒ˜๋ฆฌํ•˜๊ธฐ

์ปฌ๋Ÿผ ์ •๋ณด๊ฐ€ ๋ถˆ๋ช…ํ™•ํ–ˆ์Œ์—๋„, ์ƒ์‹์ด๋‚˜ ๊ด€๋ จ ์กฐ์‚ฌ ๊ฒฐ๊ณผ ๋ง์ด ๋˜์ง€ ์•Š๋Š” ์ด์ƒ์น˜๋“ค์ด ๋ฐœ๊ฒฌ๋˜์—ˆ๊ณ , ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ฒ˜๋ฆฌํ•˜์˜€๋‹ค.

Age : 0 ~ 109

bank = bank[((bank['Age']>0) & (bank['Age']<=109))]
bank['Customer_ID'].count() ##48599

Interest_Rate : 30์ดํ•˜

bank = bank[bank['Interest_Rate']<30]
bank['Interest_Rate'].count()  #43846

Num_of_Delayed_Payment 0์ด์ƒ 30 

cond = (bank['Num_of_Delayed_Payment'] >= 0) & (bank['Num_of_Delayed_Payment'] <= 30)
bank = bank[cond]
bank['Num_of_Delayed_Payment'].count() ##43,226

 

Monthly Balance, Delay_from_due_date ์Œ์ˆ˜์‚ญ์ œ

bank = bank[bank['Monthly_Balance']>0]
bank['Monthly_Balance'].count() ##43220

cond = (bank['Delay_from_due_date']) >= 0 & (bank['Delay_from_due_date'] <= 54)
bank = bank[cond]
bank['Delay_from_due_date'].count()##42942