๐Ÿ“Š Data Analysis/๐Ÿ—‚๏ธ Note

๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐฉ๋ฒ• ํŠน๊ฐ• #2 ํฌ๋กค๋ง

ny:D 2024. 8. 27. 18:33

240717 Today I Learn

ํฌ๋กค๋ง๊ณผ ์Šคํฌ๋ž˜ํ•‘

ํฌ๋กค๋ง vs. ์Šคํฌ๋ž˜ํ•‘

  • ์›น ํฌ๋กค๋ง: ์›น์ƒ์— ์กด์žฌํ•˜๋Š” ๋ชจ๋“  ์›น ํŽ˜์ด์ง€๋ฅผ ๋ฐฉ๋ฌธํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๋ฐฉ๋ฒ•. ํฌ๋กค๋Ÿฌ๋Š” ์ธํ„ฐ๋„ท ์ƒ์˜ ๋ชจ๋“  ํŽ˜์ด์ง€๋ฅผ ๋ฐฉ๋ฌธํ•˜๋ฉฐ, ๊ฐ ํŽ˜์ด์ง€์˜ ๋งํฌ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉด์„œ ์ž๋™์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•จ.
  • ์›น ์Šคํฌ๋ž˜ํ•‘: ํŠน์ • ์›น ์‚ฌ์ดํŠธ๋‚˜ ํŽ˜์ด์ง€์—์„œ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•ด ๋‚ด๋Š” ๊ฒƒ
  ์›น ํฌ๋กค๋ง ์Šคํฌ๋ž˜ํ•‘
๊ณตํ†ต์  ์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•  ์ˆ˜ ์žˆ๋‹ค.
๊ธฐ์ˆ ์ ์œผ๋กœ ํŒŒ์ด์ฌ์œผ๋กœ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ. 
(
์›น ํฌ๋กค๋ง์„ ํ†ตํ•œ ์›น์ ‘๊ทผ → ์Šคํฌ๋ž˜ํ•‘์„ ํ†ตํ•œ ํŠน์ • ๋ฐ์ดํ„ฐ ์ถ”์ถœ)
์ค‘๋ณต์ œ๊ฑฐ ์ค‘๋ณต์ œ๊ฑฐ ํ•„์ˆ˜
→ ๋™์ผํ•œ ์ฝ˜ํ…์ธ ๊ฐ€ ์—ฌ๋Ÿฌ ํŽ˜์ด์ง€์— ์—…๋กœ๋“œ ๋œ ๊ฒƒ์„ ์ธ์‹ํ•˜์ง€ ๋ชปํ•จ.
๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•œ ๊ฒƒ์€ ์•„๋‹˜
→ ํŠน์ • ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ
์ฐจ์ด์  ์›น์‚ฌ์ดํŠธ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ƒ‰์ธํ™”ํ•˜๊ณ  ์ €์žฅํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ
๊ฒ€์ƒ‰ ์—”์ง„ ๋ฐ ๊ธฐํƒ€ ์ž๋™ํ™” ๋„๊ตฌ์— ์˜ํ•ด ์ˆ˜ํ–‰
๋ถ„์„ ๋ฐ ๊ธฐํƒ€ ๋ชฉ์ ์„ ์œ„ํ•ด ์›น์‚ฌ์ดํŠธ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœ
์‚ฌ๋žŒ/์ด ๋ชฉ์ ์„ ์œ„ํ•ด ํŠน๋ณ„ํžˆ ์„ค๊ณ„๋œ ์ž๋™ํ™” ๋„๊ตฌ์— ์˜ํ•ด ์ˆ˜ํ–‰

โš ๏ธ ์ฃผ์˜์‚ฌํ•ญ : ํ•ฉ๋ฒ•์ธ์ง€ ํ™•์ธ ํ•ด๋ณด๊ธฐ

  • '๋กœ๋ด‡ ๋ฐฐ์ œ ํ‘œ์ค€(Robots Exclusion Standard)์„ ์ค€์ˆ˜ํ–ˆ๋Š”๊ฐ€'์— ๋Œ€ํ•œ ํ™•์ธ์ด ํ•„์š”ํ•˜๋‹ค.
    → ์›นํŽ˜์ด์ง€ ์ฃผ์†Œ ๋งจ ๋’ค์— '/robots.txt'๋ฅผ ๋ถ™์—ฌ์„œ ํ™•์ธํ•˜๊ธฐ
  • ์Šคํฌ๋ž˜ํ•‘/ํฌ๋กค๋ง์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „์— ํ•ด๋‹น ์›น ์‚ฌ์ดํŠธ์˜ robots.txt ํŒŒ์ผ์„ ๋ฐ˜๋“œ์‹œ ํ™•์ธํ•˜๊ณ , ๋กœ๋ด‡ ๋ฐฐ์ œ ํ‘œ์ค€์„ ์ค€์ˆ˜ํ•˜๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•จ.

์›น์‚ฌ์ดํŠธ ๊ตฌ์„ฑ์š”์†Œ 3๊ฐ€์ง€

  • HTML (Hyper Text Markup Language) : ์›น ์‚ฌ์ดํŠธ์˜ ๋ผˆ๋Œ€
  • CSS (Cascading Style Sheets) : ์˜ˆ์˜๊ฒŒ ๊พธ๋ฉฐ์„œ ํ‘œํ˜„
  • javascript : ๋™์ž‘ํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ์คŒ

HTML

๐ŸŽ MacOS ํฌ๋กฌ ๊ฐœ๋ฐœ์ž๋„๊ตฌ ๋‹จ์ถ•ํ‚ค(f12)
command(โŒ˜) + option (โŒฅ) +i
๐Ÿ’ก HTML (HyperText Mark-up Language)
๋ฌธ์„œ๋ฅผ ์„ค๋ช…ํ•ด์ฃผ๋Š” ์ •๋ณด๋ฅผ ํ˜„ํ•˜๋Š” ๋งˆํฌ์—… ์–ธ์–ด(=ํƒœ๊ทธ๋กœ ๋‘˜๋Ÿฌ์Œ“์ธ ์–ธ์–ด=<>)
๋ฌธ์„œ์˜ ๋‚ด์šฉ ์ด์™ธ์˜ ๋ฌธ์„œ์˜ ๊ตฌ์กฐ๋‚˜ ์„œ์‹ ๊ฐ™์€ ๊ฒƒ์„ ํฌํ•จ

  • ๋ฌธ์„œ ํ˜•์‹ ์„ ์–ธ(Doctype): HTML ๋ฌธ์„œ๊ฐ€ ์–ด๋–ค ๋ฒ„์ „์˜ HTML์ด๋‚˜ XHTML๋กœ ์ž‘์„ฑ๋˜์—ˆ๋Š”์ง€ ์›น ๋ธŒ๋ผ์šฐ์ €์—๊ฒŒ ์•Œ๋ ค์คŒ.
  • ๋ฃจํŠธ ์š”์†Œ(html): ๋ฃจํŠธ ์š”์†Œ๋Š” HTML ๋ฌธ์„œ์˜ ์ตœ์ƒ์œ„ ์š”์†Œ๋กœ ๋ชจ๋“  ๋‹ค๋ฅธ HTML ์š”์†Œ๋“ค์„ ํฌํ•จํ•˜๋Š” ๋ถ€๋ชจ ์š”์†Œ
  • ํ—ค๋“œ(Head):  HTML ๋ฌธ์„œ์˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ์™€ ์™ธ๋ถ€ ๋ฆฌ์†Œ์Šค์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋Š” ๋ถ€๋ถ„.
    → ํ™”๋ฉด์— ์ง์ ‘์ ์œผ๋กœ ๋ณด์ด์ง€ ์•Š์ง€๋งŒ, ์›น ๋ธŒ๋ผ์šฐ์ €๊ฐ€ ๋ฌธ์„œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ํ‘œ์‹œํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•จ.
  • ๋ณธ๋ฌธ(Body): HTML ๋ฌธ์„œ์˜ ์‹ค์ œ ๋‚ด์šฉ์„ ๋‹ด๊ณ  ์žˆ๋Š” ๋ถ€๋ถ„
    → ์›น ํŽ˜์ด์ง€๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋ชจ๋“  ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€, ๋งํฌ, ํ…Œ์ด๋ธ”, ํผ ๋“ฑ์˜ ์ฝ˜ํ…์ธ ๊ฐ€ ํฌํ•จ
    • <div> : HTML๋ฌธ์„œ ๋‚ด์—์„œ ํ•œ ๊ฐœ์˜ ๊ฐ€๋กœ ๊ณต๊ฐ„(Block)์„ ๋งŒ๋“œ๋Š” ํƒœ๊ทธ
      → <div class >: ํ•˜๋‚˜์˜ ๊ณต๊ฐ„(ํด๋ž˜์Šค)๋ฅผ ๋งŒ๋“ค๊ฒ ๋‹ค
    • <p> : ์ฃผ๋กœ ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ ์‚ฌ์šฉํ•˜๋Š” ํƒœ๊ทธ์ด๋ฉฐ ์—ญ์‹œ ํ•œ ๊ฐœ์˜ ๊ฐ€๋กœ ๊ณต๊ฐ„(Block)์„ ๋งŒ๋“œ๋Š” ํƒœ๊ทธ

 

Beautiful Soup

  • HTML ํƒœ๊ทธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํฌ๋กค๋ง์„ ์ง„ํ–‰
  • request ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ์›นํŽ˜์ด์ง€๋ฅผ ํ˜ธ์ถœํ•˜๊ณ , BeautifulSoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ์›น์— ์ ‘๊ทผ๋œ ์ƒํƒœ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœ
  • ์Šคํฌ๋กค ๋ฐ ํ™”๋ฉด์ด๋™, ๋ฒ„ํŠผํด๋ฆญ ๋“ฑ์˜ ๋ฐ”๋€Œ๋Š” ๋ถ€๋ถ„์€ ํฌ๋กค๋ง์ด ๋ถˆ๊ฐ€๋Šฅ →  Selenium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ๋Œ€์ฒด
๐Ÿ’ก ํŒŒ์‹ฑ
์›น ์ƒ์˜ ์ž์—ฐ์–ด, ์ปดํ“จํ„ฐ ์–ธ์–ด ๋“ฑ์˜ ์ผ ๋ จ์˜ ๋ฌธ์ž์—ด๋“ค์„ ๋ถ„์„ํ•˜๋Š” ํ”„๋กœ์„ธ์Šค

์‹ค์Šต

# ์ฒ˜์Œ 1ํšŒ ์„ค์น˜
# pip install beautifulsoup4

# ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
import requests
from bs4 import BeautifulSoup as bs

# request ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•œ http ํ†ต์‹ ํ˜ธ์ถœ
page = requests.get("https://library.gabia.com/")

# beautifulsoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•œ page์˜ text ๋ฐ์ดํ„ฐ ํŒŒ์‹ฑ(๋ฌธ์ž์—ด๋ถ„์„)
soup = bs(page.text, "html.parser")

# select ๋ฌธ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ html ๋‚ด ํ•„์š”ํ•œ ๋ถ€๋ถ„์„ ๋ชจ๋‘ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. 
# div.esg-entry-content ์—์„œ a ๋ฅผ ํƒ€๊ณ  ๊ทธ ์•„๋ž˜๋กœ span ์„ ํƒ€๊ณ  ๋‚ด๋ ค๊ฐ€์„œ text ๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค.
elements = soup.select('div.esg-entry-content a > span')

# for ๊ตฌ๋ฌธ์„ ํ†ตํ•ด ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•ด์ค๋‹ˆ๋‹ค. 
for index, element in enumerate(elements, 1):
    print("{} ๋ฒˆ์งธ ๊ฒŒ์‹œ๊ธ€์˜ ์ œ๋ชฉ: {}".format(index, element.text))

Selenium

  • ์™„์ „ํžˆ ํฌ๋กค๋ง์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์•„๋‹Œ, ์›น๊ฐœ๋ฐœ์ž๋“ค์ด ๋™์  ์›น์ด ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜๋Š”์ง€๋ฅผ ํ…Œ์ŠคํŠธํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋“ˆ
    → ๋™์  ํฌ๋กค๋ง์ด ๊ฐ€๋Šฅ
  • ํฌ๋กฌ ๋“ฑ์˜ ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์ž๋™์œผ๋กœ ์—ด์–ด์„œ ์ปจํŠธ๋กคํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ

์‹ค์Šต

CGV ์— ์˜ฌ๋ผ์˜จ '๋ฒ”์ฃ„๋„์‹œ2' ๊ด€๋ จ ํ‰์ ์„ ์…€๋ ˆ๋‹ˆ์›€์„ ์ด์šฉํ•ด ํฌ๋กค๋ง & ์Šคํฌ๋ž˜ํ•‘ ํ•ด๋ณด์ž.

 

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
import time
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

 

ํ•จ์ˆ˜

def get_movie_reviews(url, page_num=12):
    # WebDriver ์„œ๋น„์Šค ๊ฐ์ฒด ์‚ฌ์šฉ

    #service = ChromeService(executable_path=ChromeDriverManager().install())
    #wd = webdriver.Chrome(service=service)
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    wd = webdriver.Chrome(options=chrome_options)

    wd.get(url)
    writer_list = []
    review_list = []
    date_list = []
    like_list = []

    for page_no in range(1, page_num+1):
        try:
            page_ul = wd.find_element(By.ID, 'paging_point')
            page_a = page_ul.find_element(By.LINK_TEXT, str(page_no))
            page_a.click()
            time.sleep(2)

            writers = wd.find_elements(By.CLASS_NAME, 'writer-name')
            writer_list += [writer.text for writer in writers]

            reviews = wd.find_elements(By.CLASS_NAME, 'box-comment')
            review_list += [review.text for review in reviews]

            dates = wd.find_elements(By.CLASS_NAME, 'day')
            date_list += [date.text for date in dates]

            likes = wd.find_elements(By.ID, 'idLikeValue')
            like_list += [like.text for like in likes]

            if page_no % 10 == 0:
                next_button = page_ul.find_element(By.CLASS_NAME, "paging-side")
                next_button.click()
                time.sleep(2)

        except NoSuchElementException:
            break

    movie_review_df = pd.DataFrame({
        "Writer": writer_list,
        "Review": review_list,
        "Date": date_list,
        "Like": like_list
    })
    wd.close()
    return movie_review_df

movie_review_df

# ์‚ฌ์šฉ ์˜ˆ์‹œ
movie_review_df = get_movie_reviews("http://www.cgv.co.kr/movies/detail-view/?midx=85813", page_num=12)
movie_review_df.to_csv('๋ฒ”์ฃ„๋„์‹œ2ํฌ๋กค๋ง.csv', index=False, encoding="utf-8-sig")