Big Data
[Crawling] 검색어 기반 대용량 데이터 수집 #2
알 수 없는 사용자
2019. 4. 14. 20:04
# python 3.7, Anaconda 사전 설치 필요
1. beautifulsoup4 모듈 설치 (Anaconda Prompt)
- pip install beautifulsoup4 : beautifulsoup4 설치
- pip list : 라이브러리 설치 여부 확인, beautifulsoup4 확인
2. Scraping 연습(출처 : https://wayhome25.github.io/python/2017/04/25/cs-27-crawling/)
1) Scraping 할 Site : https://www.rottentomatoes.com
- 영화 제목과 링크리스트 가져오기
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.rottentomatoes.com/"
html = urlopen(url)
source = html.read() # byte code type, read source
html.close() # urlopen, close session
soup = BeautifulSoup(source, "html5lib")
# 파싱할 문서를 BeautifulSoup 클래스의 생성자에 넘겨주어 문서 개체를 생성
table = soup.find(id="Top-Box-Office")
movies = table.find_all(class_ = "middle_col")
for movie in movies:
title = movie.get_text()
print(title, end = ' ')
link = movie.a.get('href')
url = 'https://www.rpttentotatoes.com' + link
print(url)
# 결과값
# 문서객체 뜯어보기
soup
<!DOCTYPE html>
<html dit="ltr" lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/"><head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">
<!-- salt=lay-def-02-juRm -->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
....
# ID로 객체 찾기
table
<table class="movie_list" id="Top-Box-Office">
<tbody><tr class="sidebarInTheaterOpening">
<td class="left_col">
<a href="/m/shazam">
<span class="icon tiny certified_fresh"></span>
<span class="tMeterScore">90%</span>
</a>
</td>
<td class="middle_col">
<a href="/m/shazam">Shazam!</a>
</td>
<td class="right_col right">
<a href="/m/shazam">$53.5M</a>
</td>
</tr>
<tr class="sidebarInTheaterOpening">
<td class="left_col">
<a href="/m/pet_sematary_2019">
<span class="icon tiny rotten"></span>
<span class="tMeterScore">58%</span>
</a>
</td>
<td class="middle_col">
<a href="/m/pet_sematary_2019">Pet Sematary</a>
</td>
<td class="right_col right">
<a href="/m/pet_sematary_2019">$24.6M</a>
</td>
</tr>
....
# Class명으로 찾기
movies
<td class="middle_col">
<a href="/m/shazam">Shazam!</a>
</td>, <td class="middle_col">
<a href="/m/pet_sematary_2019">Pet Sematary</a>
</td>, <td class="middle_col">
<a href="/m/dumbo_2019">Dumbo</a>
</td>, <td class="middle_col">
<a href="/m/us_2019">Us</a>
</td>, <td class="middle_col">
<a href="/m/captain_marvel">Captain Marvel</a>
</td>, <td class="middle_col">
<a href="/m/the_best_of_enemies_2019">The Best of Enemies</a>
</td>, <td class="middle_col">
<a href="/m/five_feet_apart">Five Feet Apart</a>
</td>, <td class="middle_col">
<a href="/m/unplanned">Unplanned</a>
</td>, <td class="middle_col">
<a href="/m/wonder_park">Wonder Park</a>
</td>, <td class="middle_col">
<a href="/m/how_to_train_your_dragon_the_hidden_world">How to Train Your Dragon: The Hidden World</a>
</td>
3. Data Base 구축 (MariaDB)
- #1의 ERD를 참고하여 작성함
- PK 내 지정된 컬럼들의 Size들로 인해서 error 1701, Specified key was too long: max key length is ~ 에러 발생
→ 대부분의 Column Size 축소
CREATE TABLE `Master` (
`SEARCH_WORD` VARCHAR(200) NOT NULL,
`SEARCH_RESULT_COUNT` INT NULL
);
CREATE TABLE `SUB_MASTER_WAREHOUSE` (
`SEARCH_TITLE` VARCHAR(200) NOT NULL,
`ADDRESS` VARCHAR(200) NOT NULL,
`SEARCH_WORD` VARCHAR(200) NOT NULL,
`PAGE_NUMBER` INT NULL,
`CONTENTS_NUMBER` INT NULL
);
CREATE TABLE `SUB_MASTER_DATAMART` (
`SEARCH_TITLE` VARCHAR(200) NOT NULL,
`ADDRESS` VARCHAR(200) NOT NULL,
`SEARCH_WORD` VARCHAR(200) NOT NULL,
`PAGE_NUMBER` INT NULL,
`CONTENTS_NUMBER` INT NULL
);
CREATE TABLE `SUB_CONTENTS` (
`SEARCH_TITLE` VARCHAR(200) NOT NULL,
`ADDRESS` VARCHAR(200) NOT NULL,
`SEARCH_WORD` VARCHAR(200) NOT NULL,
`CONTENT01` VARCHAR(2000),
`CONTENT02` VARCHAR(2000)
);
ALTER TABLE `Master` ADD CONSTRAINT `PK_MASTER` PRIMARY KEY (
`SEARCH_WORD`
);
ALTER TABLE `SUB_MASTER_WAREHOUSE` ADD CONSTRAINT `PK_SMW` PRIMARY KEY (
`SEARCH_TITLE`,
`ADDRESS`,
`SEARCH_WORD`
);
ALTER TABLE `SUB_MASTER_DATAMART` ADD CONSTRAINT `PK_SUB_MASTER_DATAMART` PRIMARY KEY (
`SEARCH_TITLE`,
`ADDRESS`,
`SEARCH_WORD`
);
ALTER TABLE `SUB_CONTENTS` ADD CONSTRAINT `PK_SUB_CONTENTS` PRIMARY KEY (
`SEARCH_TITLE`,
`ADDRESS`,
`SEARCH_WORD`
);
ALTER TABLE `SUB_MASTER_WAREHOUSE` ADD CONSTRAINT `FK_Master_TO_SUB_MASTER_WAREHOUSE_1` FOREIGN KEY (
`SEARCH_WORD`
)
REFERENCES `Master` (
`SEARCH_WORD`
);
ALTER TABLE `SUB_MASTER_DATAMART` ADD CONSTRAINT `FK_Master_TO_SUB_MASTER_DATAMART_1` FOREIGN KEY (
`SEARCH_WORD`
)
REFERENCES `Master` (
`SEARCH_WORD`
);
ALTER TABLE `SUB_CONTENTS` ADD CONSTRAINT `FK_SUB_MASTER_DATAMART_TO_SUB_CONTENTS_1` FOREIGN KEY (
`SEARCH_TITLE`
)
REFERENCES `SUB_MASTER_DATAMART` (
`SEARCH_TITLE`
);
ALTER TABLE `SUB_CONTENTS` ADD CONSTRAINT `FK_SUB_MASTER_DATAMART_TO_SUB_CONTENTS_2` FOREIGN KEY (
`ADDRESS`
)
REFERENCES `SUB_MASTER_DATAMART` (
`ADDRESS`
);
ALTER TABLE `SUB_CONTENTS` ADD CONSTRAINT `FK_SUB_MASTER_DATAMART_TO_SUB_CONTENTS_3` FOREIGN KEY (
`SEARCH_WORD`
)
REFERENCES `SUB_MASTER_DATAMART` (
`SEARCH_WORD`
);
# 참고한 Site : https://wayhome25.github.io/python/2017/04/25/cs-27-crawling/