๐Ÿ“Š Data Analysis/๐ŸŽฏ Project

๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„(4)

ny:D 2024. 5. 27. 00:23

๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ : ์€ํ–‰ ๊ณ ๊ฐ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค ๋ถ„์„

๐Ÿ—‚๏ธ ๋ฐ์ดํ„ฐ ์ •๊ทœํ™” & ํ‘œ์ค€ํ™”

Age_group์„ ์ƒ์„ฑ

๊ณ ๊ฐ๋“ค์˜ Age๋ฅผ 10๋Œ€, 20๋Œ€, 30๋Œ€ ๋“ฑ์˜ ๋ฒ”์ฃผ๋กœ ๊ทธ๋ฃนํ™”ํ•ด ‘age_group’ ์ปฌ๋Ÿผ์„ ์ถ”๊ฐ€์ ์œผ๋กœ ์ƒ์„ฑํ–ˆ๋‹ค.

bank_numeric['age_group']= [int(i//10)*10 for i in bank_numeric['Age']]

 

Loan Type์„ ๋ถ„๋ฆฌํ•ด ๋”๋ฏธ ๋ณ€์ˆ˜๋กœ ๋งŒ๋“ค๊ธฐ

Type_of_Loan ์ปฌ๋Ÿผ์˜ ๊ฒฝ์šฐ Loan์˜ ์ข…๋ฅ˜๋ณ„๋กœ ๊ตฌ๋ถ„๋œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ปด๋งˆ๋ฅผ ํ†ตํ•ด ํ•˜๋‚˜์˜ ๊ฐ’์œผ๋กœ ์—ฐ๊ฒฐ๋˜์–ด์žˆ๋Š”๋ฐ, ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ๋œ๋‹ค.
โœ… ๊ณ ๊ฐ๋งˆ๋‹ค ์–ด๋–ค ์ข…๋ฅ˜์˜ ๋Œ€์ถœ์„ ๋ฐ›์•˜๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์–ด๋ ค์›€
โœ… ๋Œ€์ถœ ์ข…๋ฅ˜๋Š” ๊ฐ™์•„๋„ ์ˆœ์„œ๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ํ•˜๋‚˜์˜ ์ปฌ๋Ÿผ์œผ๋กœ ํ•ฉ์น˜๊ธฐ ์–ด๋ ค์›€
๋”ฐ๋ผ์„œ ๋ฐ์ดํ„ฐ์—์„œ ์กฐ๊ธˆ๋” ์‰ฝ๊ฒŒ ์ธ์‚ฌ์ดํŠธ๋ฅผ ๋ฝ‘์•„๋‚ด๊ธฐ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ณตํ•จ. (๋”๋ฏธ ๋ณ€์ˆ˜ํ™”)

loans = ['Auto Loan', 'Credit-Builder Loan','Personal Loan','Payday Loan','Mortgage Loan','Home Equity Loan','Debt Consolidation Loan','Student Loan', 'Not Specified']

for loan_type in loans:
    customer[loan_type] = [1 if loan_type in entry else 0 for entry in customer['Type_of_Loan']]

์ด ๊ณผ์ •์—์„œ ๊ฒฐ์ธก์น˜๊ฐ€ ๋งŽ์•˜๋˜ Num_of_Loan(๋Œ€์ถœ ๊ฐœ์ˆ˜)๋ฅผ ์œ„์—์„œ ๋ถ„๋ฆฌํ•œ ๋Œ€์ถœ ์ข…๋ฅ˜ ์ปฌ๋Ÿผ๋“ค์„ ๋ฐ”ํƒ•์œผ๋กœ ์ƒˆ๋กœ ๊ณ„์‚ฐํ–ˆ๋‹ค.

customer['Num_of_Loan'] = customer['Auto Loan']+customer['Credit-Builder Loan']+customer['Personal Loan']+customer['Payday Loan']+customer['Mortgage Loan']+customer['Home Equity Loan']+customer['Debt Consolidation Loan']+customer['Student Loan']+customer['Not Specified']

 

๐Ÿ“Š ์‹œ๊ฐํ™” - ์†Œ๋“์ด ๋†’์€ ๊ณ ๊ฐ(VIP)๊ณ ๊ฐ์„ ์žก์•„๋ผ!

1. ์†Œ๋“์ด ๋†’์€ ๊ณ ์†Œ๋“์ž ๊ณ ๊ฐ์„ ์ •์˜

๋ชจ๋ฆ„์ง€๊ธฐ VIP๋ผ๋ฉด, ๋‚จ๋“ค๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ ๋ˆ๐Ÿ’ฐ์ด ์•„์ฃผ ๋งŽ์•„์•ผ ํ•œ๋‹ค.! ํ‰๊ท ์ ์ด์ง€ ์•Š์€, ํ‰๋ฒ”ํ•˜์ง€ ์•Š์€ ๊ณ ๊ฐ์„ ์ฐพ์•„์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— IQR ๊ธฐ๋ฒ•์„ ์ด์šฉํ•ด ์‹ ๋ขฐ๊ตฌ๊ฐ„์„ ๋ฒ—์–ด๋‚œ ์ด์ƒ์น˜์— ์ฃผ๋ชฉํ•ด๋ณด๊ฒ ๋‹ค.

# x ์ถ• ํ‘œ์‹œ ํ˜•์‹์„ ๋ณ€๊ฒฝํ•˜๋Š” ํ•จ์ˆ˜ ์ •์˜
from matplotlib.ticker import FuncFormatter

# x ์ถ• ํ‘œ์‹œ ํ˜•์‹์„ ๋ณ€๊ฒฝํ•˜๋Š” ํ•จ์ˆ˜ ์ •์˜
def format_func(x, pos):
    return '{:,.0f}'.format(x/1000)  # #,### ํ˜•์‹์œผ๋กœ ํฌ๋งทํŒ…
plt.figure(figsize=(16,9))
sns.boxplot(data=customer,x='Annual_Income')

# x ์ถ•์— ์ ์šฉํ•  Formatter ์„ค์ •
formatter = FuncFormatter(format_func)
plt.gca().xaxis.set_major_formatter(formatter)

# ๊ทธ๋ž˜ํ”„ ํ‘œ์‹œ
plt.xlabel('Annual_Income($1k)')

upper bound ์ด์ƒ์˜ ๊ณ ์•ก์—ฐ๋ด‰์ž๋“ค์€ ์ด 512๋ช…์œผ๋กœ, ์ „์ฒด 1,1437๋ช…์˜ ๊ณ ๊ฐ์ค‘ 4.47%๋ฅผ ์ฐจ์ง€ํ•œ๋‹ค. ๊ณ ์•ก ์—ฐ๋ด‰์ž๋“ค์„ ๋”ฐ๋กœ high_salary_group ํ…Œ์ด๋ธ”๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ ์ž์„ธํžˆ ์‚ดํŽด๋ณด์ž.

q1 = customer['Annual_Income'].quantile(0.25)
q3 = customer['Annual_Income'].quantile(0.75)
upper_bound = q3+(q3-q1)*1.5

customer[customer['Annual_Income']>=upper_bound]['Customer_ID'].count()

 

โœ… 30๋Œ€ ์ดˆ๋ฐ˜(30-35์„ธ)์˜ ๊ณ ๊ฐ์ด 88๋ช…์œผ๋กœ ๊ฐ€์žฅ ๋งŽ๋‹ค. (๋น„์œจ ์ ์œผ๋กœ๋„ 164/3326, 4.93%)

# 30๋Œ€ ์ดˆ๋ฐ˜(30-35์„ธ)์˜ ๊ณ ๊ฐ์ด 88๋ช…์œผ๋กœ ๊ฐ€์žฅ ๋งŽ๋‹ค.

sns.histplot(data=high_salary_group,x='Age',
             binrange=[10,60], bins=10,
             color='green', alpha=0.5)

# ์ตœ๋Œ€ ๋นˆ๋„์ˆ˜ ๊ณ„์‚ฐํ•˜๊ธฐ
cond = (high_salary_group['Age']<35) & (high_salary_group['Age']>=30)
max_cnt = high_salary_group.loc[cond,'Age'].count()

plt.text(31.5, 100, max_cnt)
plt.ylim(0,110)
plt.title('Age Distribution of High Salary Group')

โœ… High Salary Group์˜ ๊ฒฝ์šฐ ๊ทธ ์™ธ ๊ณ ๊ฐ๋“ค์— ๋น„ํ•ด ํ‰๊ท ์ ์œผ๋กœ ์—ฐ์ฒด์ผ์ˆ˜ Delay_from_due_date , ์—ฐ์ฒด ํšŸ์ˆ˜ Num_of_Delayed_Payment , ์ฑ„๋ฌดOutstanding_Debt๊ฐ€ ์ ๋‹ค.

# palette 
green_palette = ['#22CC88','#EEEEEE']

# customer ํ…Œ์ด๋ธ”์— VIP, else๋กœ ๊ฐ’์ด ์ž…๋ ฅ๋˜๋Š” salary_group ์ปฌ๋Ÿผ ์ƒ์„ฑ
customer['salary_group'] = ['VIP' if income > upper_bound else 'else' for income in customer['Annual_Income']]

# Delay from Due Date
sns.boxplot(data=customer, x='salary_group', y='Delay_from_due_date', palette=green_palette)
plt.title('Delay from Due Date by Salary Group')

# Num_of_Delayed_Payment
sns.boxplot(data=customer, x='salary_group', y='Num_of_Loan', palette = green_palette)
plt.title('Number of Loans by Salary Group')

# Outstanding Debt
sns.boxplot(data=customer, x='salary_group', y='Outstanding_Debt', palette=green_palette)
plt.title('Outstanding Debt by Salary Group')

โœ… High Salary Group์˜ ๊ฒฝ์šฐ ๊ทธ ์™ธ ๊ณ ๊ฐ๋“ค์— ๋น„ํ•ด ํ‰๊ท ์ ์œผ๋กœ ์‹ ์šฉ ์ „ํ™˜์œจ Credit_Utilization_Ratio , ์›” ์ž”๊ณ ์•ก Monthly_Balance , ์›”๊ฐ„ ์ €์ถ•์•ก Amount_invested_monthly ์ด ๋†’๋‹ค.

 

# palette 
green_palette = ['#22CC88','#EEEEEE']

# customer ํ…Œ์ด๋ธ”์— VIP, else๋กœ ๊ฐ’์ด ์ž…๋ ฅ๋˜๋Š” salary_group ์ปฌ๋Ÿผ ์ƒ์„ฑ
customer['salary_group'] = ['VIP' if income > upper_bound else 'else' for income in customer['Annual_Income']]

# Credit_Utilization_Ratio 
sns.boxplot(data=customer, x='salary_group', y='Credit_Utilization_Ratio', palette=green_palette)
plt.title('Credit Utilization Ratio by Salary Group')

# Monthly_Balance
sns.boxplot(data=customer, x='salary_group', y='Monthly_Balance',palette = green_palette)
plt.title('Monthly Balance by Salary Group')

# Amount_invested_monthly 
sns.boxplot(data=customer, x='salary_group', y='Amount_invested_monthly', palette = green_palette)
plt.title('Amount invested monthly by Salary Group')

 

2. VIP ๊ณ ๊ฐ๋“ค์˜ ๋Œ€์ถœํ˜„ํ™ฉ ํŒŒ์•…

# customer ํ…Œ์ด๋ธ”์— annual_income์„ ์ด์šฉํ•ด ์›”๊ธ‰ ๊ณ„์‚ฐ, ์›”๊ธ‰ ์ปฌ๋Ÿผ ์ถ”๊ฐ€
customer['Monthly_Income'] = (customer['Annual_Income'])/4

# ํ•„์š”ํ•œ ์ปฌ๋Ÿผ๋งŒ ์ถ”์ถœํ•ด investment_stat ํ…Œ์ด๋ธ”๋กœ ๋นผ๊ธฐ
lst = ['Customer_ID','age_group', 'Monthly_Income', 'Outstanding_Debt', 'Amount_invested_monthly', 'Monthly_Balance','Num_of_Loan','salary_group', 'Num_of_Delayed_Payment']
investment_stat = customer[lst]

# investment_stat ํ…Œ์ด๋ธ”์— monthly income ๋Œ€๋น„ ์›” ํˆฌ์ž์•ก ๋น„์œจ์„ ๊ณ„์‚ฐํ•ด investment_ratio ์ปฌ๋Ÿผ ์ถ”๊ฐ€
investment_stat['investment_ratio'] = (customer['Amount_invested_monthly']/customer['Monthly_Income'])*100

# VIP ๊ทธ๋ฃน์˜ ๊ฐ’๋งŒ ์ €์žฅํ•œ stat ํ…Œ์ด๋ธ” ๋งŒ๋“ค๊ธฐ
stat = investment_stat[investment_stat['salary_group']=='VIP']

โœ… High Salary Group์˜ ๊ฒฝ์šฐ ๊ทธ ์™ธ ๊ณ ๊ฐ๋“ค๊ณผ ๋น„๊ตํ–ˆ์„ํ–ˆ๋•Œ ๋Œ€์ถœ ๊ฐœ์ˆ˜์˜ ํ‰๊ท ์ด ๋น„์Šทํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ „๋ฐ˜์ ์œผ๋กœ high salary group์ด ๋Œ€์ถœ ๊ฐœ์ˆ˜๊ฐ€ ๋‚ฎ๋‹ค.

sns.boxplot(data=customer, x='salary_group', y='Num_of_Loan', palette = green_palette)
plt.title('Number of Loans by Salary Group')

 

โœ… VIP vs. Others

# palette ๋ฐ explode
palette = ['#EEEEEE','#EEEEEE','#EEEEEE','#EEEEEE','#EEEEEE','#EEEEEE','#EEEEEE','#22CC88','#B9FFE3']
explode = [0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01]

# pie chart ๊ทธ๋ฆฌ๊ธฐ - Loan Types of High Salary Group
plt.figure(figsize=(16,9))
plt.pie(loan_grouped['VIP'],
        startangle=20,
        explode=explode,
        colors = palette, autopct='%1.2f%%',)
plt.legend(loan_grouped.index,bbox_to_anchor=(1, 0.5))
plt.title('Loan Types of High Salary Group')

# pie chart ๊ทธ๋ฆฌ๊ธฐ - Loan Types of others
plt.figure(figsize=(16,9))
plt.pie(loan_grouped['else'],
        startangle=20,
        explode=explode,
        colors = palette, autopct='%1.2f%%',)
plt.legend(loan_grouped.index,bbox_to_anchor=(1, 0.5))
plt.title('Loan Types of others')
  • VIP ๊ณ ๊ฐ๋“ค์˜ ๋Œ€์ถœ ์œ ํ˜•์—์„œ ๊ฐ€์žฅ ํฐ ๋ถ€๋ถ„์„ ์ฐจ์ง€ ํ•˜๋Š” ์œ ํ˜•์€ Student Loan(12.10%)์ด๋ฉฐ, ์ •์˜๋˜์ง€ ์•Š์€ ๋Œ€์ถœ ์œ ํ˜•(Not Specified, 9.46%)์ด ๊ฐ€์žฅ ์ ๋‹ค.
  • VIP ๊ณ ๊ฐ์ด ์•„๋‹Œ ๋‹ค๋ฅธ ๊ณ ๊ฐ๋“ค์˜ ๊ฒฝ์šฐ Payday Loan(11.32%)์ด ๊ฐ€์žฅ ๋งŽ๊ณ , Auto Loan(10.78%)๋กœ ๊ฐ€์žฅ ์ ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ 11%๋Œ€๋กœ ๋Œ€์ฒด์ ์œผ๋กœ ๋Œ€์ถœ ์œ ํ˜•์ด ๋น„์Šทํ•˜๋‹ค.

 

โœ… VIP ๊ทธ๋ฃน ๋‚ด์—์„œ age group๋ณ„ ๋Œ€์ถœ ํ˜„ํ™ฉ

vip_age_loantype = customer.groupby(['salary_group','age_group'])[loans].sum().iloc[0:5]
vip_age_loantype = vip_age_loantype.reset_index().drop(['salary_group','age_group'],axis=1).T
vip_age_loantype.columns = [10,20,30,40,50]
vip_age_loantype
  • ๋น„๊ต์  ๋‚˜์ด๊ฐ€ ์–ด๋ฆฐ 10-30๋Œ€์˜ ๊ฒฝ์šฐ Credit-Builder Loan(์‹ ์šฉ ์ ์ˆ˜๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ํŠน๋ณ„ํ•œ ์œ ํ˜•์˜ ๋Œ€์ถœ)๊ฐ€ ๋งŽ์Œ
  • 40๋Œ€์˜ ๊ฒฝ์šฐ Debt Consolidation Loan, Student Loan ์ด ๊ฐ€์žฅ ๋งŽ์Œ.
  • 50๋Œ€์˜ ๊ฒฝ์šฐ Auto Loan, Debt Consolidation Loan, Student Loan์ด ๊ฐ€์žฅ ๋งŽ์Œ.
  • ์ „์ฒด Age Group์—์„œ ๊ฐ€์žฅ ๋Œ€์ถœ์„ ์ ๊ฒŒ ํ•˜๋Š” ์—ฐ๋ น๋Œ€๋Š” 50๋Œ€, ๊ฐ€์žฅ ํ™œ๋ฐœํžˆ ํ•˜๋Š” ์—ฐ๋ น๋Œ€๋Š” 30๋Œ€์ž„.

โœ…  VIP ๊ณ ๊ฐ๋“ค์˜ ๋ฏธ์ƒํ™˜๋ถ€์ฑ„ ์ด์•ก(Outstanding Debt)๋Š” ์–ด๋– ํ•œ๊ฐ€?

## VIP vs. Others
sns.boxplot(data=investment_stat, x='salary_group', y = 'Outstanding_Debt', palette= green_palette)

## age_group๋ณ„ outstanding_debt
green_palette2 = ['#22CC88','#43DFA0','#78ECBD','#B9FFE3','#EFFBF7']

# 'salary_group'์˜ median ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌ๋œ ์ˆœ์„œ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
median_order = stat.groupby('age_group')['Outstanding_Debt'].median().sort_values().index

# median ๊ฐ’์— ๋”ฐ๋ผ ์ •๋ ฌ๋œ ์ˆœ์„œ๋กœ boxplot์„ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค.
sns.boxplot(data=stat, x='age_group', y='Outstanding_Debt', palette=green_palette2, order=median_order)
  • VIP ๊ณ ๊ฐ์ด ํ‰๊ท ์ ์œผ๋กœ ๋ฏธ์ƒํ™˜์ฑ„๋ฌด๊ฐ€ else ๊ทธ๋ฃน๋ณด๋‹ค ์ ์Œ
  • VIP ๊ณ ๊ฐ์ค‘ ๋ฏธ์ƒํ™˜ ์ฑ„๋ฌด๊ฐ€ ๊ฐ€์žฅ ์ ์€ ๊ทธ๋ฃน์€ 30๋Œ€(783.975, 1์œ„), 40๋Œ€(820.020, 2์œ„), 50๋Œ€(849.690,3์œ„)

โœ… VIP ๊ณ ๊ฐ๋“ค์€ ์ฑ„๋ฌด๋ฅผ ์„ฑ์‹คํžˆ ์ดํ–‰ํ•˜๋Š”๊ฐ€?

## VIP vs.Others
sns.boxplot(data=investment_stat, x='salary_group', y = 'Num_of_Delayed_Payment', palette= green_palette)

## age_group
# order ์ •๋ฆฌ
median_order = stat.groupby('age_group')['Num_of_Delayed_Payment'].median().sort_values().index

# median ๊ฐ’์— ๋”ฐ๋ผ ์ •๋ ฌ๋œ ์ˆœ์„œ๋กœ boxplot์„ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค.
sns.boxplot(data=stat, x='age_group', y='Num_of_Delayed_Payment', palette=green_palette2, order=median_order)
  • else ๊ทธ๋ฃน๋ณด๋‹ค๋Š” VIP ๊ณ ๊ฐ์ด ์ฑ„๋ฌด๋ฅผ ์„ฑ์‹คํžˆ ์ดํ–‰ํ•œ๋‹ค.
  • VIP ๊ณ ๊ฐ ์ค‘์—์„œ๋Š” 50๋Œ€ ๊ณ ๊ฐ์ด ์ฑ„๋ฌด๋ฅผ ๊ฐ€์žฅ ์„ฑ์‹คํžˆ ์ดํ–‰ํ•˜๋ฉฐ, 10๋Œ€ ๊ณ ๊ฐ์ด ํ‰๊ท ์ ์œผ๋กœ ์ฒด๋‚ฉ์„ ๊ฐ€์žฅ ๋งŽ์ด ํ•œ๋‹ค.