๐Ÿ“’ Today I Learn/๐Ÿ Python

[๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ] ์ด์ƒ์น˜ ๋ฐ ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

ny:D 2024. 5. 31. 19:15

240531 Today I Learn

์ด์ƒ์น˜, ๊ฒฐ์ธก์น˜๋ž€?

EDA ํ”„๋กœ์„ธ์Šค

๐Ÿ’ก ๊ฒฐ์ธก์น˜
๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๊ณผ์ •์—์„œ ์ธก์ •๋˜์ง€ ์•Š๊ฑฐ๋‚˜ ๋ˆ„๋ฝ๋œ ๋ฐ์ดํ„ฐ

๐Ÿ’ก ์ด์ƒ์น˜
์ „์ฒด ๋ฐ์ดํ„ฐ ๋ฒ”์œ„์—์„œ ๋ฒ—์–ด๋‚œ ์•„์ฃผ ์ž‘์€ ๊ฐ’์ด๋‚˜ ํฐ ๊ฐ’

 

ํ™œ์šฉ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

๋”๋ณด๊ธฐ
import pandas as pd 
import numpy as np
import time
from PIL import Image
import warnings

# ์˜ค๋ฅ˜ ๊ฒฝ๊ณ  ๋ฌด์‹œํ•˜๊ธฐ
warnings.filterwarnings(action='ignore')

๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌํ•˜๊ธฐ

1. ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐํ•˜๊ธฐ

  • ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” ์—ด ์ œ๊ฑฐํ•˜๊ธฐ

df3 → dropna๋ฅผ ํ•œ ๋’ค์˜ df3

# ์ปฌ๋Ÿผ๋ช…์ด 'Unnamed:4'์ธ ์—ด ์ œ๊ฑฐํ•˜๊ธฐ
df3 = df3.drop('Unnamed: 4', axis=1)

# ์ปฌ๋Ÿผ์ด ์ œ๊ฑฐ๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ
df3.isnull().sum()

## user id             295
## product id          295
## Interaction type    423
## Time stamp          295
  • ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” ํ–‰ ์ œ๊ฑฐํ•˜๊ธฐ

# ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” ํ–‰๋“ค์€ ๋ชจ๋‘ ์ œ๊ฑฐํ•˜๊ธฐ
df3.dropna(inplace=True)

# ๊ฒฐ์ธก์น˜๊ฐ€ ์ œ๊ฑฐ๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ
df3.isnull().sum()

## user id             0
## product id          0
## Interaction type    0
## Time stamp          0

2. ๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๊ธฐ

  • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ : ์ตœ๋นˆ๊ฐ’์œผ๋กœ ์ฑ„์šฐ๊ธฐ

df3.info() / df3 / df3.isnull.count()

  • df3์— ์žˆ๋Š” 'Interaction Type' ์ปฌ๋Ÿผ์€ ์ด 423๊ฐœ์˜ ๊ฒฐ์ธก์น˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 
  • ํ•ด๋‹น ์ปฌ๋Ÿผ์€ 'purchase', 'view', 'like' ์ด ์„ธ๊ฐœ์˜ ๋ฒ”์ฃผ๋กœ ๊ตฌ๋ถ„๋œ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š” ์ปฌ๋Ÿผ์ด๋ฏ€๋กœ ์ตœ๋นˆ๊ฐ’์œผ๋กœ ๊ฒฐ์ธก์น˜๋ฅผ ์ฑ„์šธ ์ˆ˜ ์žˆ๋‹ค.

 

# ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” Interaction type ์ปฌ๋Ÿผ์„ ์ตœ๋นˆ๊ฐ’์œผ๋กœ ๋Œ€์ฒดํ•˜๊ธฐ ์œ„ํ•ด, ํ•ด๋‹น ์ปฌ๋Ÿผ์˜ ์ตœ๋นˆ๊ฐ’์„ ๊ตฌํ•จ
df3['Interaction type'].mode()

# ์ตœ๋นˆ๊ฐ’ ๊ฒ€์ฆ : ๊ฒฐ์ธก์น˜๋Š” ์นด์šดํŠธ๋˜์ง€ ์•Š์•˜์Œ 
df3.groupby('Interaction type').count()

→ df3์˜ 'Interaction type' ์ปฌ๋Ÿผ์˜ ์ตœ๋นˆ๊ฐ’์€ like์ด๋‹ค.

# fillna()๋ฅผ ์ด์šฉํ•ด ์ตœ๋นˆ๊ฐ’์œผ๋กœ ๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๊ธฐ
df3 = df3['Interaction type'].fillna(df3['Interaction type'].mode().iloc[0])

# ์—ฐ์‚ฐ ํ›„ ์ธ๋ฑ์Šค ์žฌ์„ค์ •
df3= df3.reset_index()

# ์ตœ๋นˆ๊ฐ’์œผ๋กœ ๋Œ€์ฒด๋œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ™•์ธ 
df3.groupby('Interaction type').count()

→ df3์˜ 'Interaction type' ์ปฌ๋Ÿผ์— ์žˆ์—ˆ๋˜ 423๊ฐœ์˜ ๊ฒฐ์ธก์น˜๊ฐ€ ์ตœ๋นˆ๊ฐ’์ธ 'like'๋กœ ๋ชจ๋‘ ์ฑ„์›Œ์กŒ๋‹ค.

  • ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜: ํ‰๊ท / ์ค‘์•™๊ฐ’์œผ๋กœ ์ฑ„์šฐ๊ธฐ
# Shipping Weight ์˜ ๊ฒฝ์šฐ, 1138 ๊ฐœ์˜ ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. 
df['Shipping Weight'].isnull().sum()

ํ‰๊ท / ์ค‘์•™๊ฐ’ ๊ณ„์‚ฐ์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ •์ œ

# split ์ด์šฉํ•ด ๋‹จ์œ„ ๋ถ€๋ถ„์„ ์ œ์™ธํ•œ ๋ฌด๊ฒŒ ์ •๋ณด ์ถ”์ถœ
df['Shipping Weight'].str.split().str[1]

# string to float, ์—๋Ÿฌ๋ฌด์‹œ  
df['sw'] = pd.to_numeric(df['sw'] , errors='coerce')

df.loc[df['Shipping Weight'].isna(), ['Shipping Weight', 'sw']]

# ํ‰๊ท ๊ฐ’ ๋Œ€์ฒด
# inplace=True ๋กœ ํ•˜๋ฉด ์›๋ณธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฐ”๋€Œ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
df['sw'] = df['sw'].fillna(df['sw'].mean())

# ์ค‘๊ฐ„๊ฐ’ ๋Œ€์ฒด
# inplace=True ๋กœ ํ•˜๋ฉด ์›๋ณธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฐ”๋€Œ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
df['sw'] = df['sw'].fillna(df['sw'].median())

# group by ๋ฅผ ํ™œ์šฉํ•œ ๋Œ€์ฒด 
# Is Amazon Seller ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ค‘์•™๊ฐ’ ๊ณ„์‚ฐํ•˜์—ฌ ๋Œ€์ฒด 
df['sw'] = df['sw'].fillna(df.groupby('Is Amazon Seller')['sw'].transform('median'))
  • ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜: KNN Imputation 
๐Ÿ’ก KNN Imputation
๋ชจ๋“  ๋ณ€์ˆ˜๋ฅผ ๊ณ ๋ คํ•˜์—ฌ k๊ฐœ์˜ ๊ฐ€๊นŒ์šด ๋ฐ์ดํ„ฐ๋ฅผ ์ฐพ์•„ ๊ทธ ๋ฐ์ดํ„ฐ์™€์˜ ๊ฑฐ๋ฆฌ์— ๋”ฐ๋ผ ๊ฐ€์ค‘ ํ‰๊ท ์„ ์ทจํ•ด์คŒ
์–ด๋–ป๊ฒŒ ์ฑ„์›Œ์•ผํ•  ์ง€ ์•„์ด๋””์–ด๊ฐ€ ์—†์„ ๋•Œ ์ฃผ๋กœ ์‚ฌ์šฉ
  • KnnImputer๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„  ๋ฐ์ดํ„ฐ์— ๋ฌธ์ž์—ด์ด ์กด์žฌํ•ด์„  ์•ˆ๋จ
# ์•„๋ž˜ ์ฝ”๋“œ๋Š” Shipping Weight ๋ฅผ int ํ˜•์‹์œผ๋กœ ๋ณ€๊ฒฝ
df['sw'] = df['Shipping Weight'].str.split().str[0]
df['sw'] = pd.to_numeric(df['sw'] , errors='coerce').fillna(0.0).astype(int)

# KNN -> ๋ฐ˜๋“œ์‹œ ์ธ์ฝ”๋”ฉ ์ž‘์—…์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ˆซ์žํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ์–ด์•ผํ•จ.
knn_df = df[['sw']]
  • ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐ์ดํ„ฐ ํ‘œ์ค€ํ™”๊ฐ€ ์„ ํ–‰๋˜์–ด์•ผํ•จ. → ๋ชจ๋“  ๋ณ€์ˆ˜๋“ค์˜ ๊ฑฐ๋ฆฌ๊ณ„์‚ฐ์— ๋Œ€ํ•œ ๊ธฐ์—ฌ๋„๋ฅผ ๋น„์Šทํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ

scale_df

# ํ‘œ์ค€ํ™”๋ฅผ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ ์–ธ
from sklearn.preprocessing import StandardScaler

# ํ‘œ์ค€ํ™” ์ง„ํ–‰
scale_df = StandardScaler().fit_transform(knn_df)

filled_df.isnull().sum() / filled_df

# KNN Imputation ์ง„ํ–‰ํ•˜๊ธฐ
import pandas as pd
from sklearn.impute import KNNImputer

# KNNImputer ๊ฐ์ฒด ์ƒ์„ฑ
imputer = KNNImputer(n_neighbors=3)  # K(์ด์›ƒ์˜ ์ˆ˜) ์ง€์ •

# KNN์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐ์ธก์น˜ ๋Œ€์ฒด
filled_df = pd.DataFrame(imputer.fit_transform(scale_df), columns=knn_df.columns)

์ด์ƒ์น˜ ์ฒ˜๋ฆฌํ•˜๊ธฐ

1. ์ด์ƒ์น˜ ์‹๋ณ„ ๊ธฐ๋ฒ•

๐Ÿ’ก Z-Score
๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ์ด๋ฃฐ ๋•Œ, ๋ฐ์ดํ„ฐ์˜ ํ‘œ์ค€ ํŽธ์ฐจ๋ฅผ ์ด์šฉํ•ด ์ด์ƒ์น˜๋ฅผ ํƒ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ•.
๊ฐ ๋ฐ์ดํ„ฐ(ํ–‰) ๋งˆ๋‹ค Z-score ๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค. Z ๊ฐ’์€ X์—์„œ ํ‰๊ท ์„ ๋บ€ ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋‚˜๋ˆˆ ๊ฐ’์ด๋ฉฐ, ํ‘œ์ค€ ์ ์ˆ˜๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.
ํ‘œ์ค€์ ์ˆ˜(Z๊ฐ’)์€  ํ‰๊ท ์œผ๋กœ๋ถ€ํ„ฐ ์–ผ๋งˆ๋‚˜ ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ, ±3 ์ด์ƒ์ด๋ฉด ์ด์ƒ์น˜๋กœ ๊ฐ„์ฃผ
  • Z-Score : 0 ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋Š” ํ‰๊ท ๊ณผ ๊ฐ™์Œ (=ํ‰๊ท ์—์„œ ๋–จ์–ด์ง„ ๊ฑฐ๋ฆฌ๊ฐ€ 0)
  • Z-Score > 0 : ํ•ด๋‹น ๋ฐ์ดํ„ฐ ํ‰๊ท ๋ณด๋‹ค ํผ.
  • Z-Score = 1 : ํ•ด๋‹น ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋Š” ํ‰๊ท ๋ณด๋‹ค 1 ํ‘œ์ค€ํŽธ์ฐจ๋งŒํผ ๋” ํฐ ๊ฐ’์ด๋‹ค.
  • Z-Score > 0 : ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋Š” ํ‰๊ท ๋ณด๋‹ค ์ž‘๋‹ค.
  • Z-Score = -1 : ํ•ด๋‹น ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋Š” ํ‰๊ท ๋ณด๋‹ค 1 ํ‘œ์ค€ํŽธ์ฐจ๋งŒํผ ๋” ์ž‘์Œ 
# 1. string -> float -> int ๋กœ ํƒ€์ž… ๋ณ€ํ™˜
df['sw'] = df['Shipping Weight'].str.split().str[0]
df['sw'] = pd.to_numeric(df['sw'] , errors='coerce').fillna(0.0).astype(int)

# 2. z-score ๋ฅผ ์ ์šฉํ•  ์ปฌ๋Ÿผ ์„ ์ • ๋ฐ ํ‘œ์ค€ํ™” ์ง„ํ–‰
# ์ปฌ๋Ÿผ์„ ์ •
df1 = df[['sw']]

# ํ‘œ์ค€ํ™” ์ง„ํ–‰
scale_df = StandardScaler().fit_transform(df1)

# ํ‘œ์ค€ํ™”๋œ scale_df๋ฅผ df1์— ์—ฐ๊ฒฐํ•˜๊ณ  ์ปฌ๋Ÿผ๋ช… ๋ณ€๊ฒฝ
merge_df = pd.concat([df1, pd.DataFrame(scale_df)],axis=1)
merge_df.columns = ['Shipping Weight', 'zscore']

strange_df / strange_df.count()

# 3. ์ด์ƒ์น˜ ๊ฐ์ง€ 
# Z-SCORE ๊ธฐ๋ฐ˜, -3 ๋ณด๋‹ค ์ž‘๊ฑฐ๋‚˜ 3๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ๋ฅผ ์ด์ƒ์น˜๋กœ ํŒ๋ณ„ 
mask = ((merge_df['zscore']<-3) | (merge_df['zscore']>3))

# mask ๋ฉ”์†Œ๋“œ ์‚ฌ์šฉํ•ด ์ด์ƒ์น˜ ๊ฐ’ ๋ฝ‘๊ธฐ
strange_df = merge_df[mask]
strange_df

# ์ด 55 ๊ฑด ํƒ์ง€ 
strange_df.count()

## Shipping Weight    55
## zscore             55

 

๐Ÿ’ก IQR
IQR : (์ œ 3์‚ฌ๋ถ„์œ„ ๊ฐ’) - (์ œ 1์‚ฌ๋ถ„์œ„ ๊ฐ’)
๋ฐ์ดํ„ฐ์˜ 25% ์ง€์ ()๊ณผ 75% ์ง€์ () ์‚ฌ์ด์˜ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ๊ฐ’๋“ค์€ ๋ชจ๋‘ ์ด์ƒ์น˜๋กœ ๊ฐ„์ฃผ
๐Ÿšจ ์ด์ƒ์น˜๋Š” Q3 + 1.5 * IQR๋ณด๋‹ค ๋†’๊ฑฐ๋‚˜ Q1 - 1.5 * IQR๋ณด๋‹ค ๋‚ฎ์€ ๊ฐ’์„ ์˜๋ฏธ

  • ์ด๋ฅผ ๊ทธ๋ฆผ์œผ๋กœ ๊ทธ๋ฆฐ ๊ฒƒ์„ Box Plot ์ด๋ฉฐ, IQR ๋ฐ–์˜ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋Š” ์ด์ƒ์น˜๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ์ด๋ฃจ์ง€ ์•Š์„ ๋•Œ ์‚ฌ์šฉ

# 1. string -> float -> int ๋กœ ํƒ€์ž… ๋ณ€ํ™˜
df['sw'] = df['Shipping Weight'].str.split().str[0]
df['sw'] = pd.to_numeric(df['sw'] , errors='coerce').fillna(0.0).astype(int)

 

# 2. z-score ๋ฅผ ์ ์šฉํ•  ์ปฌ๋Ÿผ ์„ ์ • ๋ฐ ํ‘œ์ค€ํ™” ์ง„ํ–‰
# ์ปฌ๋Ÿผ์„ ์ •
df1 = df[['sw']]

# Q3, Q1, IQR ๊ฐ’ ๊ตฌํ•˜๊ธฐ
q3 = df1['sw'].quantile(0.75) 
q1 = df1['sw'].quantile(0.25)
iqr = q3 - q1

q3, q1, iqr
## (7.0, 1.0, 6.0)

→ ๋ฐฑ๋ถ„์œ„์ˆ˜๋ฅผ ๊ตฌํ•ด์ฃผ๋Š” quantile ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ ์‰ฝ๊ฒŒ ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ
→ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ „์ฒด ํ˜น์€ ํŠน์ • ์—ด์— ๋Œ€ํ•˜์—ฌ ๋ชจ๋‘ ์ ์šฉ์ด ๊ฐ€๋Šฅ

# 3. IQR ๊ธฐ์ค€ ์ด์ƒ์น˜ ํ•„ํ„ฐ๋ง
# ์ด์ƒ์น˜ : Q3 + 1.5 * IQR๋ณด๋‹ค ๋†’๊ฑฐ๋‚˜ Q1 - 1.5 * IQR๋ณด๋‹ค ๋‚ฎ์€ ๊ฐ’์„ ์˜๋ฏธ
def is_outlier(df1):
    score = df1['sw']
    if score > q3 + (1.5 * iqr) or score < q1 - (1.5 * iqr):
        return '์ด์ƒ์น˜'
    else:
        return '์ด์ƒ์น˜์•„๋‹˜'

# apply ํ•จ์ˆ˜๋ฅผ ํ†ตํ•˜์—ฌ ๊ฐ ๊ฐ’์˜ ์ด์ƒ์น˜ ์—ฌ๋ถ€๋ฅผ ์ฐพ๊ณ  ์ƒˆ๋กœ์šด ์—ด์— ๊ฒฐ๊ณผ ์ €์žฅ
df1['์ด์ƒ์น˜์—ฌ๋ถ€'] = df1.apply(is_outlier, axis = 1) # axis = 1 ์ง€์ • ํ•„์ˆ˜

# IQR ๋ฐฉ์‹์œผ๋กœ ๊ตฌํ•œ ์ด์ƒ์น˜ ๊ฐœ์ˆ˜๋Š” 349 ๊ฐœ 
df1.groupby('์ด์ƒ์น˜์—ฌ๋ถ€').count()

 

๐Ÿ’ก Isolation Forest
๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ, ์ปฌ๋Ÿผ ๊ฐฏ์ˆ˜๊ฐ€ ๋งŽ์„ ๋•Œ ์ด์ƒ์น˜๋ฅผ ํŒ๋ณ„ํ•˜๊ธฐ ์šฉ์ดํ•จ.
ํ•œ ๋ฒˆ ๋ถ„๋ฆฌ๋  ๋•Œ ๋งˆ๋‹ค ๊ฒฝ๋กœ ๊ธธ์ด๊ฐ€ ๋ถ€์—ฌ๋˜๊ณ , ํŠธ๋ฆฌ์—์„œ ๋ช‡ ๋ฒˆ์„ ๋ถ„๋ฆฌํ•ด์•ผ ํ•˜๋Š”์ง€๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ์ด์ƒ์น˜์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ํŒ๋‹จ.
→ ์ด์ƒ์น˜๋Š” ๋‹ค๋ฅธ ๊ด€์ธก์น˜์— ๋น„ํ•ด ์งง์€ ๊ฒฝ๋กœ ๊ธธ์ด๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์ด๋‹ค.

  • ๋ฐ์ดํ„ฐ์…‹์„ ๊ฒฐ์ •ํŠธ๋ฆฌ ํ˜•ํƒœ๋กœ ํ‘œํ˜„ํ•จ.
    → ์งˆ๋ฌธ๋“ค์ด ๊ผฌ๋ฆฌ๋ฅผ ๋ฌผ๊ณ  ์ด์–ด์ ธ, ์ด์— ๊ฐ ๊ฐ’์€ ๋งค, ํŽญ๊ท„, ๋Œ๊ณ ๋ž˜, ๊ณฐ ์ค‘ ํ•˜๋‚˜์— ๋ฐฐ์น˜๋˜๊ฒŒ ๋จ.
  • ๊ฒฝ๋กœ ๊ธธ์ด๋กœ ์ ์ˆ˜๋Š” 0 ์—์„œ 1 ์‚ฌ์ด๋กœ ์‚ฐ์ถœ๋˜๋ฉฐ, ๊ฒฐ๊ณผ๊ฐ€ 1 ์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์ด์ƒ์น˜๋กœ ๊ฐ„์ฃผํ•˜๊ฒŒ .
๐Ÿ’ก DBScan
๋ฐ€๋„ ๊ธฐ๋ฐ˜์˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์–ด๋– ํ•œ ํด๋Ÿฌ์Šคํ„ฐ์—๋„ ํฌํ•จ๋˜์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ด์ƒ์น˜๋กœ ํƒ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ•
ํ•ต์‹ฌ ํฌ์ธํŠธ๋“ค์ด ์„œ๋กœ ์—ฐ๊ฒฐ๋˜์–ด ๊ตฐ์ง‘์„ ํ˜•์„ฑํ•˜๋ฉฐ, ์ด์™€ ์—ฐ๊ฒฐ๋˜์ง€ ์•Š์€ ํฌ์ธํŠธ๋Š” ์ด์ƒ์น˜๋กœ ๋ถ„๋ฅ˜ํ•จ.

  • ๋ณต์žกํ•œ ๊ตฌ์กฐ์˜ ๋ฐ์ดํ„ฐ์—์„œ ์ด์ƒ์น˜๋ฅผ ํŒ๋ณ„ํ•˜๋Š” ๋ฐ ์œ ์šฉ
  • ์ฃผ๋กœ ์ง€๋ฆฌ ๋ฐ์ดํ„ฐ ๋ถ„์„, ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ ๋ถ„์„์˜ ์ด์ƒ์น˜ ๊ธฐ๋ฒ•์œผ๋กœ ์‚ฌ์šฉ๋จ.
  • ๊ฐ ๋ฐ์ดํ„ฐ์˜ ๋ฐ€๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฐ์ง‘์„ ํ˜•์„ฑ์‹œํ‚ค๊ณ , ์„ค์ •๋œ ๊ฑฐ๋ฆฌ ๋‚ด์— ์„ค์ •๋œ ์ตœ์†Œ ๊ฐœ์ˆ˜์˜ ๋‹ค๋ฅธ ํฌ์ธํŠธ๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ, ํ•ด๋‹น ํฌ์ธํŠธ๋Š” ํ•ต์‹ฌ ํฌ์ธํŠธ๋กœ ๊ฐ„์ฃผํ•ฉ๋‹ˆ๋‹ค.

2. ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ

๐Ÿ“Œ ์ด์ƒ์น˜ ์ œ๊ฑฐ

  • ์ด์ƒ์น˜๊ฐ€ ๋ฐ์ดํ„ฐ ์˜ค๋ฅ˜๋‚˜ ์ ์ ˆํ•˜์ง€ ์•Š์€ ๊ฐ’์ผ ๊ฒฝ์šฐ ์ œ๊ฑฐ
  • ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ์ดํ„ฐ์˜ ํ‘œ๋ณธ ํฌ๊ธฐ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์‹ ์ค‘ํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Œ

 

๐Ÿ“Œ ์ด์ƒ์น˜ ๋Œ€์ฒด

  • ๋กœ๊ทธ ๋ณ€ํ™˜: ๋ฐ์ดํ„ฐ์— ๋กœ๊ทธ ๋ณ€ํ™˜์„ ์ ์šฉํ•˜์—ฌ ๊ทน๋‹จ์ ์ธ ๊ฐ’์„ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ƒํ•œ๊ฐ’๊ณผ ํ•˜ํ•œ๊ฐ’: ํ•˜ํ•œ๊ฐ’๊ณผ ์ƒํ•œ๊ฐ’์„ ๊ฒฐ์ •ํ•œ ํ›„ ํ•˜ํ•œ๊ฐ’๋ณด๋‹ค ์ ์œผ๋ฉด ํ•˜ํ•œ๊ฐ’์œผ๋กœ ๋Œ€์ฒด, ์ƒํ•œ๊ฐ’๋ณด๋‹ค ํฌ๋ฉด ์ƒํ•œ๊ฐ’์œผ๋กœ ๋Œ€์ฒด
  • ํ‰๊ท  ์ ˆ๋Œ€ ํŽธ์ฐจ: ์ค‘์œ„์ˆ˜๋กœ๋ถ€ํ„ฐ nํŽธ์ฐจ ํฐ ๊ฐ’์„ ๋Œ€์ฒดํ•œ๋‹ค.

๐Ÿ“Œ ์ด์ƒ์น˜ ๋ถ„๋ฆฌ

  • ์ด์ƒ์น˜๋ฅผ ๋ณ„๋„์˜ ๊ทธ๋ฃน์œผ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด์ƒ์น˜๊ฐ€ ๋ฐ์ดํ„ฐ์— ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์„ ๋•Œ ์œ ์šฉ
    → ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•˜์—ฌ ์ด์ƒ์น˜๋ฅผ ์ €์žฅํ•ด๋‘”๋‹ค