파이썬 중급 - 10. 멀티 스크래핑 실습 : asyncio, beautifulsoup

728x90

목적

멀티 스크래핑 실습을 통해서 비동기 I/O Coroutine 작업을 연습해본다.

Blocking I/O : 호출된 함수가 자신의 작업이 완료될 때까지 제어권을 갖고 있음. 타 함수는 대기
Non Blocking I/O : 호출된 함수(서브루틴)가 yield후 호출한 함수(메인 루틴)에 제어권을 전달. 타 함수 작업 진행

asyncio는 Non Blocking I/O를 편하게 구현할 수 있는 라이브러리다. 여기서 주의할 점은 해당 라이브러리를 사용하더라도 내가 작성한(사용하는) 함수가 Blocking 형태로 코딩이 되있다면 asyncio를 사용하는 의미가 없어진다.

설치

파이썬 프로젝트를 하나 만들고, asyncio와 beautifulsoup4를 설치한다.

pip install asyncio

pip install beautifulsoup4

비동기 구조 만들기

loop

전체적인 구조를 먼저 보자. 맨 아래 loop를 정의하는 부분에서 asyncio의 get_event_loop() 메서드를 통해서 비동기 작업을 할 context를 만들어준다. 그리고 run_until_complete()로 timeout을 설정한다.

urlopen

urlopen을 통해서 url에 해당하는 웹사이트에 접근해서 정보를 가져올 수 있다. 다만, 주석문처럼 해당 메서드는 동기작업이기 때문에 async-await를 통해서 적용한다. fetch 함수를 보면 loop의 run_in_executor()를 통해서 멀티 쓰레드로 executor들을 실행하고, 각 스레드가 urlopen을 url이라는 인자를 갖고 실행하도록 해준다. 마지막 결과값에 read() 메서드를 적용한다. [0:5]는 각 url 페이지의 결과가 너무 길어서 5글자까지만 자른 것이다.

main

main 메서드에서는 fetch-urlopen에서 사용할 executor에 대한 정보를 정의하고, 결과값을 ensure_future() 메서드를 통해 반환받는다. 결과값은 Future 객체를 gather() 메서드로 받을 수 있다.

# Async IO
# 비동기 I/O Coroutine 작업


import asyncio
import timeit
from urllib.request import urlopen
# urlopen은 Non-blocking이 아니라 Blocking 함수이다.
# 이를 해결하기 위해서 Thread나 Process와 결합하여 사용한다.
from concurrent.futures import ThreadPoolExecutor
import threading

start = timeit.default_timer()

urls = ['http://daum.net', 'https://naver.com', 'http://mlbpark.donga.com', 'https://tistory.com', 'https://wemakeprice.com/']


async def fetch(url, executor):
    print('Thread Name : ', threading.current_thread().getName(), 'Start', url)
    # main 영역에서 선언했기 때문에 loop 사용 가능
    res = await loop.run_in_executor(executor, urlopen, url)
    print('Thread Name : ', threading.current_thread().getName(), 'Done', url)
    return res.read()[0:5]

async def main():
    executor = ThreadPoolExecutor(max_workers=10)

    futures = [
        asyncio.ensure_future(fetch(url, executor)) for url in urls
    ]

    rst = await asyncio.gather(*futures)

    print('Result : ', rst)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main()) # 작업 완료까지 대기
    duration = timeit.default_timer() - start
    print('Total Running Time: ', duration)

출력결과

시작된 순서대로 스레드가 완료되지는 않는다. 그리고 실행할 때마다 순서가 바뀌는 것을 볼 수 있다.

Thread Name :  MainThread Start http://daum.net
Thread Name :  MainThread Start https://naver.com
Thread Name :  MainThread Start http://mlbpark.donga.com
Thread Name :  MainThread Start https://tistory.com
Thread Name :  MainThread Start https://wemakeprice.com/
Thread Name :  MainThread Done https://tistory.com
Thread Name :  MainThread Done https://naver.com
Thread Name :  MainThread Done http://mlbpark.donga.com
Thread Name :  MainThread Done http://daum.net
Thread Name :  MainThread Done https://wemakeprice.com/
Result :  [b'<!DOC', b'\n<!do', b'<!DOC', b'\n\t<!d', b'<!DOC']
Total Running Time:  0.695709667

응용하기 : beautifulsoup로 특정 결과 가져오기

res.read()[0:5] 부분을, 각 사이트의 <title> 태그 정보를 가져올 수 있도록 변경한다.

from bs4 import BeautifulSoup

# ...중략

soup = BeautifulSoup(res.read(), 'html.parser')
    # 전체 페이지 소스 확인
    # print(soup.prettify())
    return soup.title
    
    
#출력 결과(Result 부분만)
#Result :  [<title>Daum</title>, <title>NAVER</title>, <title>↗ 파크에 오면 즐겁다 MLBPARK</title>, <title>TISTORY</title>, <title>특가프로 위메프로</title>]

참조

1. 인프런 강의 - 우리를 위한 프로그래밍 : 파이썬 중급 (Inflearn Original)

https://www.inflearn.com/course/%ED%94%84%EB%A1%9C%EA%B7%B8%EB%9E%98%EB%B0%8D-%ED%8C%8C%EC%9D%B4%EC%8D%AC-%EC%A4%91%EA%B8%89-%EC%9D%B8%ED%94%84%EB%9F%B0-%EC%98%A4%EB%A6%AC%EC%A7%80%EB%84%90/dashboard

728x90

'Programming-[Backend] > Python' 카테고리의 다른 글

[탐험] ffmpeg-ffprobe로 AWS S3에 있는 Video 파일 메타데이터 추출: 비디오 스트림, AWS Presigned URL 방식 이해 2 (0)	2022.11.01
[탐험] ffmpeg-ffprobe로 AWS S3에 있는 Video 파일 메타데이터 추출: 비디오 스트림, AWS Presigned URL 방식 이해 1 (0)	2022.10.21
파이썬 중급 - 9. 동시성과 병렬성 : Futures (0)	2022.07.31
파이썬 중급 - 8. 제너레이터 개념 되짚기, 코루틴 이해하기 (0)	2022.07.24
파이썬 중급 - 7. 병행성 흐름: iter, hasattr, isinstance, stopIteration, Yield, itertools (0)	2022.07.23

컴퓨터 탐험가 찰리

파이썬 중급 - 10. 멀티 스크래핑 실습 : asyncio, beautifulsoup

목적

설치

비동기 구조 만들기

응용하기 : beautifulsoup로 특정 결과 가져오기

참조

'Programming-[Backend] > Python' 카테고리의 다른 글

티스토리툴바

파이썬 중급 - 10. 멀티 스크래핑 실습 : asyncio, beautifulsoup

목적

설치

비동기 구조 만들기

응용하기 : beautifulsoup로 특정 결과 가져오기

참조

'Programming-[Backend] > Python' 카테고리의 다른 글

'Programming-[Backend]/Python' Related Articles

티스토리툴바