๐Ÿ“’ Today I Learn/๐Ÿ Python

[Pandas] ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ (3) ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ, ๋ฐ์ดํ„ฐ ์ง‘๊ณ„, ๋ฐ์ดํ„ฐ ์ •๋ ฌ

ny:D 2024. 5. 9. 22:04

240509 Today I Learn

๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉํ•˜๊ธฐ

๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋ถ™์ด๊ธฐ Concat

๐Ÿ’ก Concat([๋ถ™์ผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„], axis = ์ƒํ•˜(0)/์ขŒ์šฐ(1))
pd.concat([df1, df2], axis = 0, ignore_index = False)
  • axis์˜ ๊ธฐ๋ณธ๊ฐ’์€ 0(์œ„์•„๋ž˜๋กœ ๋ถ™์ด๊ธฐ)
  • ignore_index์˜ ๊ธฐ๋ณธ๊ฐ’์€ False(์›๋ž˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์—์„œ์˜ ์ธ๋ฑ์Šค ์œ ์ง€)
    → `ignore_index = True`๋‚˜ `.reset_index(drop=True)`๋ฅผ ํ™œ์šฉํ•ด ์ƒˆ๋กœ์šด ์ธ๋ฑ์Šค๋ฅผ ๋ถ€์—ฌํ•  ์ˆ˜ ์žˆ์Œ
  • axis = 0 / axis = 1

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'], 'B': ['B3', 'B4', 'B5']})

vconcat = pd.concat([df1,df2], axis = 0)
hconcat = pd.concat([df1,df2], axis = 1)
  • ignore_index = True/ False

idx = pd.concat([df1,df2], ignore_index = False)
newidx = pd.concat([df1,df2], ignore_index = True)
newidx2 = pd.concat([df1,df2]).reset_index(drop=True)

๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋ณ‘ํ•ฉํ•˜๊ธฐ Merge

๐Ÿ’กMerge(์™ผ์ชฝ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„, ์˜ค๋ฅธ์ชฝ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„, on ='๊ณตํ†ต๋œ ์ปฌ๋Ÿผ', how=์กฐ์ธ ๋ฐฉ์‹) 
pd.merge(left_df, right_df, on='key', how='inner)
  • ๊ณตํ†ต๋œ ์ปฌ๋Ÿผ๊ฐ’(SQL์—์„œ join๊ณผ on์„ ์‚ฌ์šฉํ•˜๋“ฏ์ด ์‚ฌ์šฉํ•˜๋ฉด ๋จ)
  • how์˜ ๊ธฐ๋ณธ ๊ฐ’์€ inner๋กœ merge๋ฅผ ํ†ตํ•œ ๊ฒฐํ•ฉ์€ inner join์ด ๊ธฐ๋ณธ๊ฐ’์ด๋‹ค.
    → SQL๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ 'left', 'outer' ์˜ต์…˜์„ ์ž…๋ ฅํ•ด left join , outer join์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

left_df = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
right_df = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': [5, 6, 7, 8]})

inner_join = pd.merge(left_df, right_df, on='key', how='inner')
outer_join = pd.merge(left_df, right_df, on='key', how='outer')
left_join = pd.merge(left_df, right_df, on='key', how='left')

๋ฐ์ดํ„ฐ ์ง‘๊ณ„ํ•˜๊ธฐ

Group by ์™€ ์ง‘๊ณ„ํ•จ์ˆ˜

 

[Pandas] ์œ ์šฉํ•œ ํŒ๋‹ค์Šค ํ•จ์ˆ˜ (.loc / .groupby/ .idxmax)

240430 Today I LearnPandas '.loc[]' ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ํ–‰ ์กฐํšŒํ•˜๊ธฐ๐Ÿ’ก dataframe.loc[ ํ–‰ ์ธ๋ฑ์‹ฑ ๊ฐ’, ์—ด ์ธ๋ฑ์‹ฑ ๊ฐ’]๋ผ๋ฒจ๋กœ ์ธ๋ฑ์‹ฑ, ๋๋ฒˆํ˜ธ ํฌํ•จ์Šฌ๋ผ์ด์‹ฑ ๊ฐœ๋… ์ ์šฉ ๊ฐ€๋Šฅโ–ถ๏ธŽ ํ™œ์šฉ ์˜ˆ์‹œ ) ๊ฐ€์žฅ ๋งŽ์€ ๊ฐ€์กฑ

archivenyc.tistory.com

  • ์˜ˆ์‹œ 1) ๋ณต์ˆ˜์˜ ์—ด์„ ๊ธฐ์ค€์œผ๋กœ Group by ํ•จ์ˆ˜ ์‚ฌ์šฉํ•˜๊ธฐ
    →  ์„ฑ๋ณ„๊ณผ ์ƒ์กด์—ฌ๋ถ€๋ณ„ ๋‚˜์ด์˜ ํ‰๊ท ์„ ๊ตฌํ•˜๋ผ.
df.groupby(['Sex','Survived'])['Age'].max()

## Sex     Survived
## female  0           57.0
##         1           63.0
## male    0           74.0
##         1           80.0
## Name: Age, dtype: float64
  • ์˜ˆ์‹œ 2) ๋ณต์ˆ˜์˜ ์—ด์— ๊ฐ๊ฐ ๋‹ค๋ฅธ ์ง‘๊ณ„ ํ•จ์ˆ˜ ์ ์šฉํ•˜๊ธฐ
    → ์„ฑ๋ณ„๊ณผ ์ƒ์กด์—ฌ๋ถ€ ๋ณ„ ๋‚˜์ด์˜ ํ‰๊ท (mean)๊ณผ ์ตœ๋Œ“๊ฐ’(max), ์šด์ž„์˜ ์ตœ๋Œ“๊ฐ’์„ ๊ตฌํ•˜๋ผ.

df.groupby(['Sex','Survived']).agg({'Age':['mean','max'],'Fare': 'max'})

ํ”ผ๋ฒ—ํ…Œ์ด๋ธ” ๋งŒ๋“ค๊ธฐ

๐Ÿ’ก df.pivot_table(index = 'row_column', columns = 'col_columns', values = '๊ฐ’')
  • index : ํ–‰์— ๋“ค์–ด๊ฐˆ ์ปฌ๋Ÿผ
  • columns : ์—ด์— ๋“ค์–ด๊ฐˆ ์ปฌ๋Ÿผ
  • values : ๊ฐ’์„ ๊ณ„์‚ฐํ•  ์ปฌ๋Ÿผ
  • margins : ์ด๊ณ„
    → ๊ธฐ๋ณธ๊ฐ’ `margins = False`
  • ๊ธฐํƒ€ ์˜ต์…˜
    • fill_value : ๊ฒฐ์ธก์น˜๊ฐ€ ๋‚˜์˜ค๋Š” ๊ฒฝ์šฐ ๋Œ€์ฒดํ•  ๊ฐ’์„ ์ž…๋ ฅ
    • margins_name : ์ด๊ณ„๊ฐ’์ด ์ถœ๋ ฅ๋  ์ปฌ๋Ÿผ๋ช…

 

shop = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-01'],
    'cat': ['shoes', 'acc', 'clothes', 'acc', 'shoes'],
    'price': [100,20,300,50,80]
})

shop1 = shop.pivot_table(index = 'Date', columns = 'cat', values = 'price', aggfunc = 'sum')
shop2 = shop.pivot_table(index = 'Date', columns = 'cat', values = 'price',\
                 aggfunc = 'sum', fill_value= 0)
shop3 = shop.pivot_table(index = 'Date', columns = 'cat', values = 'price',\
                 aggfunc = 'sum', fill_value= 0,\
                 margins = True, margins_name= 'tot')

๋ฐ์ดํ„ฐ ์ •๋ ฌํ•˜๊ธฐ

๋”๋ณด๊ธฐ
newjeans = pd.DataFrame({
    'name' : ['Minji','Hanni','Danielle','Haerin', 'Hyein'],
    'age' : [20, 19, 19, 17, 16],
    'nationality' : ['Korea', 'Australia', 'Australia','Korea','Korea'],
    'height': [169,161.7, 165, 165, 170]
})

์ปฌ๋Ÿผ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•˜๊ธฐ : sort_values() 

๐Ÿ’ก sort_values(๊ธฐ์ค€์ปฌ๋Ÿผ, ascending = True)
  • ์˜ˆ์‹œ 1) ํ‚ค(Height)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์˜ค๋ฆ„์ฐจ์ˆœ, ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ
  •  

asc_newjeans = newjeans.sort_values('height')
desc_newjeans = newjeans.sort_values('height',ascending = False)
  • ์˜ˆ์‹œ 2) ํ‚ค(Height)์™€ ๋‚˜์ด(age)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ

newjeans.sort_values(['height','age'], ascending = False)
  • ์˜ˆ์‹œ 3) ํ‚ค(Height)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ, ๊ฐ™์€ ๊ฐ’์€ ๋‚˜์ด(age)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ

newjeans.sort_values(['height','age'], ascending = [False, True])

์ธ๋ฑ์Šค ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•˜๊ธฐ : sort_index()

๐Ÿ’ก sort_index(๊ธฐ์ค€์ปฌ๋Ÿผ, ascending = True)

newjeans.sort_index(ascending=False)

→ ์ธ๋ฑ์Šค๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ