How to crawl with Python Selenium

August 1, 2024

How to crawl network novels with Python Selenium

It is important for ML (machine learning) to prepare qulified data. It is common to crawl on the internet to retrive the target data. This blog is an example for crawling Chinese network novels. More about data cleaning see my next blog.

Conponents: Selenium, Beautiful Soup

Install selenium and bs4
```
pip install selenium
pip install bs4
```
Use selenium to run the driver and bs4 to get the website content

from selenium import webdriver
from bs4 import BeautifulSoup

# Selenium driver
driver = webdriver.Chrome()
url = your_url
driver.get(url)

# Adjust according to your need
driver.implicitly_wait(10)

# Get the website page content
html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')
text = soup.get_text(separator="\n")

# Clean the data with regexes
...

# Store the cleaned data in local files or databases
...

Note: this just takes Chrome driver as an example. You can download the Chrome driver from getwebdriver,com.

Now it is done!

Ziqi Chen

How to crawl network novels with Python Selenium