How to crawl with Python Selenium
How to crawl network novels with Python Selenium
It is important for ML (machine learning) to prepare qulified data. It is common to crawl on the internet to retrive the target data. This blog is an example for crawling Chinese network novels. More about data cleaning see my next blog.
Conponents: Selenium, Beautiful Soup
- Install
selenium
andbs4
pip install selenium pip install bs4
- Use
selenium
to run the driver andbs4
to get the website content
from selenium import webdriver
from bs4 import BeautifulSoup
# Selenium driver
driver = webdriver.Chrome()
url = your_url
driver.get(url)
# Adjust according to your need
driver.implicitly_wait(10)
# Get the website page content
html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')
text = soup.get_text(separator="\n")
# Clean the data with regexes
...
# Store the cleaned data in local files or databases
...
Note: this just takes Chrome driver as an example. You can download the Chrome driver from getwebdriver,com.
Now it is done!