How to crawl network novels with Python Selenium

It is important for ML (machine learning) to prepare qulified data. It is common to crawl on the internet to retrive the target data. This blog is an example for crawling Chinese network novels. More about data cleaning see my next blog.

Conponents: Selenium, Beautiful Soup

  1. Install selenium and bs4
    pip install selenium
    pip install bs4
    
  2. Use selenium to run the driver and bs4 to get the website content
from selenium import webdriver
from bs4 import BeautifulSoup

# Selenium driver
driver = webdriver.Chrome()
url = your_url
driver.get(url)

# Adjust according to your need
driver.implicitly_wait(10)

# Get the website page content
html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')
text = soup.get_text(separator="\n")

# Clean the data with regexes
...

# Store the cleaned data in local files or databases
...

Note: this just takes Chrome driver as an example. You can download the Chrome driver from getwebdriver,com.

Now it is done!