๐Ÿ“Š Data Analysis/๐ŸŽฏ Project

๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„ (2)

ny:D 2024. 5. 20. 23:50

๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„

 

 

๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„ (1)

๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„ํ”„๋กœ์ ํŠธ ๊ฐœ์š”๋ถ„์„ ๋ชฉ์  : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ์„œ๋น„์Šค์˜ ํ˜„ํ™ฉ์„ ๋ถ„์„ํ•˜๊ณ  ๊ณ ๊ฐ์„ ๋ถ„๋ฅ˜ํ•˜๊ธฐ.๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ : Kaggle๋ฐ์ดํ„ฐ ์†Œ๊ฐœ  Ba

archivenyc.tistory.com

๋‹ค์‹œ ์‹œ์ž‘ํ•œ ์ „์ฒ˜๋ฆฌ..! ๊ทธ๋ž˜๋„ ๋ชฉํ‘œ๊ฐ€ ๊ตฌ์ฒดํ™” ๋œ ๋•์— ์˜คํžˆ๋ ค ๊ฐ–๊ณ  ๊ฐ€์•ผ ํ•  ๋ถ€๋ถ„๊ณผ ๋ฒ„๋ฆฌ๊ณ  ๊ฐ€์•ผ ํ•  ๋ถ€๋ถ„์ด ๋ช…ํ™•ํ•ด์ ธ์„œ ์ข‹์•˜๋‹ค. ๋ชธ์ด ์•ˆ๋”ฐ๋ผ์ค˜์„œ ์‹œ๊ฐ„์„ ๋งŽ์ด ์Ÿ์ง€๋Š” ๋ชปํ–ˆ์ง€๋งŒ, ๊ทธ๋ž˜๋„ ๋ฐ€๋„ ์žˆ๊ฒŒ ๋ณด๋‚ธ๊ฒƒ ๊ฐ™์•„ ๋ฟŒ๋“ฏํ–ˆ๋˜ ํ•˜๋ฃจ!

๐ŸŽฏ ๋ชฉํ‘œ ์„ค์ •

๋‚˜์ด, ์ง์—…, ์—ฐ๊ฐ„ ์†Œ๋“ ๋“ฑ์˜ ํŠน์„ฑ์— ๋”ฐ๋ผ ๊ณ ๊ฐ์„ ์–ด๋–ป๊ฒŒ ์„ธ๋ถ„ํ™”ํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

→ ๊ณ ๊ฐ ID(Customer_ID)๋ฅผ ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ํ•œ ํ…Œ์ด๋ธ” ์ƒ์„ฑํ•˜๊ธฐ

  • ์ˆซ์ž ๋ฐ์ดํ„ฐํƒ€์ž… : mean์„ ์ด์šฉํ•ด ๊ณ ๊ฐ๋ณ„ ํ•ด๋‹น ์ปฌ๋Ÿผ๊ฐ’์˜ ํ‰๊ท  ๊ตฌํ•˜๊ธฐ
  • ๋ฌธ์ž ๋ฐ์ดํ„ฐํƒ€์ž… : ํ™œ์šฉํ•˜๊ธฐ ์ข‹๊ฒŒ ์ „์ฒ˜๋ฆฌํ•˜๊ธฐ

 

๐Ÿ’พ  ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

1. ์ˆซ์ž ๋ฐ์ดํ„ฐํƒ€์ž…

# ์–ธ๋”๋ฐ” ์ œ๊ฑฐ ํ›„ ํƒ€์ž… ๋ณ€๊ฒฝ ํ•จ์ˆ˜
def del_underbar_int(data, col):
    for i, x in enumerate(data[col]):
        if pd.notnull(x) and '_' in x:
            data.at[i, col] = x.strip('_')
    
    data[col] = data[col].fillna(0).astype('int64')

def del_underbar_float(data, col):
    for i, x in enumerate(data[col]):
        if pd.notnull(x) and '_' in x:
            data.at[i, col] = x.strip('_')
    
    data[col] = data[col].fillna(0).astype('float')
    
# object -> int/ float๋กœ ์ปฌ๋Ÿผ ํƒ€์ž… ๋ณ€๊ฒฝ
del_underbar_int(bank,'Age')
del_underbar_int(bank,'Num_of_Delayed_Payment')

del_underbar_float(bank,'Annual_Income')
del_underbar_float(bank,'Outstanding_Debt')
del_underbar_float(bank,'Monthly_Balance')
del_underbar_float(bank,'Amount_invested_monthly')

# ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง (mask ์ด์šฉ, 0์„ธ ์ด์ƒ 109์„ธ ๋ฏธ๋งŒ) ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋งŒ๋“ค๊ธฐ
mask = ((bank['Age']>0) & (bank['Age']<109)) # bank['Age'].describe() ๊ฒฐ๊ณผ mean 109.7144, ์ƒ์‹์ ์œผ๋กœ ์—ฐ๋ น๋Œ€๊ฐ€ ๊ทธ ์ด์ƒ์ธ ๊ฒƒ์€ ์ด์ƒ์น˜๋ผ๊ณ  ํŒ๋‹จ. ์ œ์™ธ
df1 = bank[mask] 
sns.boxplot(data=df1, x='Age') #boxplot์œผ๋กœ ๋ถ„ํฌ ํ™•์ธ

# ์ˆซ์žํ˜• ์ปฌ๋Ÿผ๋งŒ ์ถ”์ถœ
df1 = df1[['Customer_ID', 'Age', 'Annual_Income',
     'Delay_from_due_date', 'Num_of_Delayed_Payment', 'Outstanding_Debt', 
     'Credit_Utilization_Ratio', 'Total_EMI_per_month', 
     'Monthly_Balance','Amount_invested_monthly']]

# age, ๊ทธ ์™ธ ์ปฌ๋Ÿผ๋“ค์„ ๋”ฐ๋กœ ๋งŒ๋“ค์–ด ๊ฒฐํ•ฉํ•˜๊ธฐ
df11 = df1[df1['Age']<=100].groupby('Customer_ID').mean().iloc[:,0]
df12 = df1.groupby('Customer_ID').mean().iloc[:,1:9]

bank_float= pd.merge(df11,df12, on = 'Customer_ID', how = 'inner')

# ํ™•์ธํ•˜๊ธฐ
bank_float

2. ๋ฌธ์ž ๋ฐ์ดํ„ฐํƒ€์ž…

# ๐ŸŒˆType_of_Loan ๊ฐฏ์ˆ˜ ์ œ๋Œ€๋กœ ์„ธ๊ธฐ

bank['Type_of_Loan'] = bank['Type_of_Loan'].str.replace(', and ', ', ')

def get_length(data):
    if pd.isnull(data):
        return 0  # ๊ฒฐ์ธก์น˜์ธ ๊ฒฝ์šฐ 0์„ ๋ฐ˜ํ™˜
    lst = data.split(',')
    return len(lst)
    
bank['Num_of_Loan'] = bank['Type_of_Loan'].apply(get_length)
 

 # ๐ŸŒˆ< Credit_History_Age ์ปฌ๋Ÿผ์—์„œ ์•ž 2์ž๋ฆฌ ๊ฐ€์ ธ์˜ค๊ธฐ >

def extract_years(data):
  if pd.isnull(data):
    return 0
  result = data.split()[0]
  return result

bank['Credit_History_Age'] = bank['Credit_History_Age'].apply(extract_years)
bank['Credit_History_Age']


# ๐ŸŒˆ๋ฌธ์ž์—ดํ–‰ < ๊ณ ๊ฐ ํ…Œ์ด๋ธ” (๋ฌธ์ž์—ด) - Customer_ID, Occupation, Type of loan, Credit Mix >

# 1. ํ•ด๋‹น ์ปฌ๋Ÿผ๋งŒ bank ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ์ƒˆ๋กœ bank_str๋กœ ๋งŒ๋“ค๊ธฐ
bank_str = bank.loc[:, ['Customer_ID', 'Occupation', 'Type_of_Loan', 'Credit_Mix']]

# 2. '_'๊ฐ€ ํฌํ•จ๋œ ์ด์ƒ๊ฐ’ 4๊ฐœ์”ฉ ๋ฐ˜๋ณต๋˜๋Š” ๋ฌธ์ž์—ด๋กœ ์ฑ„์šฐ๊ธฐ - ๋ฐ์ดํ„ฐ 4๊ฐœ ๋™์ผํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ
# 3. ๋™์ผํ•œ 4๊ฐœ์˜ ๋ฐ์ดํ„ฐ 1๊ฐœ๋กœ ์ค„์ด๊ธฐ

import pandas as pd
import numpy as np

def fill_missing_values(df):
    # 4๊ฐœ์”ฉ ๋ฐ˜๋ณต๋˜๋Š” ๊ทธ๋ฃน์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ฒ˜๋ฆฌ
    for i in range(0, len(df), 4):
        group = df.iloc[i:i+4]
        # Occupation ์ปฌ๋Ÿผ ์ฒ˜๋ฆฌ
        occupation_mode = group.loc[~group['Occupation'].str.contains('_'), 'Occupation'].mode()
        if not occupation_mode.empty:
            df.loc[i:i+3, 'Occupation'] = occupation_mode[0]
        # Credit_Mix ์ปฌ๋Ÿผ ์ฒ˜๋ฆฌ
        credit_mix_mode = group.loc[~group['Credit_Mix'].str.contains('_'), 'Credit_Mix'].mode()
        if not credit_mix_mode.empty:
            df.loc[i:i+3, 'Credit_Mix'] = credit_mix_mode[0]
    return df

# ๊ฒฐ์ธก๊ฐ’ ์ฑ„์šฐ๊ธฐ
bank_str_filled = fill_missing_values(bank_str)

# ์ค‘๋ณต๋œ ํ–‰ ์ œ๊ฑฐ (์ฒซ ๋ฒˆ์งธ ํ–‰๋งŒ ๋‚จ๊น€)
bank_str_unique = bank_str_filled.drop_duplicates('Customer_ID')

# ๋ณ€ํ™˜๋œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ํ™•์ธ
bank_str_unique


# ๐ŸŒˆ bank_str_unique ์—์„œ Type_of_Loan ๊ฐฏ์ˆ˜ ์ œ๋Œ€๋กœ ์„ธ๊ณ  Num_of_Loan ์ปฌ๋Ÿผ ์ถ”๊ฐ€ํ•˜๊ธฐ
def get_length(data):
  if pd.isnull(data):
    return 0  # ๊ฒฐ์ธก์น˜์ธ ๊ฒฝ์šฐ 0์„ ๋ฐ˜ํ™˜
  lst = data.split(',')
  return len(lst)
    
bank_str_unique['Num_of_Loan'] = bank_str_unique['Type_of_Loan'].apply(get_length)

# ๐ŸŒˆ Type_of_Loan ๋ฌธ์ž์—ด ํ˜•์‹ ๊ฒฐ์ธก๊ฐ’ NaN -> ๋นˆ ๋ฌธ์ž์—ด๋กœ ์ฑ„์šฐ๊ธฐ
bank_str_unique['Type_of_Loan'] = bank_str_unique['Type_of_Loan'].fillna('')

3.  ๋ฐ์ดํ„ฐ ๊ฒฐํ•ฉํ•˜๊ธฐ ๋ฐ ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต

# bank_float์™€ bank_str_unique ๊ฒฐํ•ฉํ•˜๊ธฐ
df = pd.merge(bank_float, bank_str_unique, on = 'Customer_ID', how = 'inner')

# Age์— ๋”ฐ๋ฅธ ๋‚˜์ด๋Œ€ ๋ถ„๋ฅ˜
df['age_group']= [int(i//10)*10 for i in df['Age']]