WordPress Blog Scraping with BeautifulSoup
Nowadays blog scraping is one of the methods often used to obtain data which later can make up a corpus that could later be employed for NLP. According to [1] “40% of the web is built on WordPress”, therefore developing the skills required to scrape these blogs can be extremely useful.
WordPress provides an API that allows us to interact with its websites, which sends and receives data as a JSON. We will use this feature along with BeautifulSoup to scrape all the available posts in a blog.
This article will exemplify how to scrape these blogs using BeautifulSoup,
a Python library that helps us parse data out of HTML and XML files. We will also use requests
to access the website.
To be sure that the blog we want to extract data from can be scraped using the WordPress API, we first must ensure that the website is in WordPress and also check if adding /wp-json/wp/v2/posts
returns a JSON.
We should try adding this to the main website:
http://website/wp-json/wp/v2/posts
Or we could try:
http://website/blog/wp-json/wp/v2/posts
Finding the place where we have to add /wp-json/wp/v2/posts
might need a little time. Be sure to explore all of the plausible places, so we can obtain the JSON we will parse.
Now that we have a website that can be scraped, we can begin writing our Jupyter Notebook to obtain the posts.
import requestsfrom bs4 import BeautifulSoup
Afterward, we will include the information regarding the blog we want to scrape.
orig_url = 'http://website_to_use/wp-json/wp/v2/posts/'blog_name = 'blog_name'
As shown in different posts, defining a user-agent in our headers helps us avoid forbidden access codes.
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}blog_json = requests.get(orig_url, headers=headers).json()
We will now have a list of dictionaries of length 10 (WordPress default), in which each element corresponds to each one of the posts.
To ensure we are getting all the posts we will order our results in ascending order, adding the parameter ?order=asc
to the orig_url
orig_url = 'http://website_to_use/wp-json/wp/v2/posts?order=asc'
For this example, we will only work with the first entry. We will now analyze what keys will help us extract the content of these posts.
blog_json = requests.get(orig_url, headers=headers).json()entry = blog_json[0]list(entry.keys())
To understand what each key returns you can consult the WordPress documentation. The keys we will use are date
, title
and content
.
entry_date = entry['date_gmt']entry_title = entry['title']['rendered']entry_content = entry['content']['rendered']
We will proceed to use BeautifulSoup
to extract the text from entry_content.
entry_text = BeautifulSoup(entry_content).get_text()
As we can see, BeautifulSoup
does a great job of getting the text. Unfortunately, it sometimes misses strings like \xa0
or \n
, but this can easily be corrected with:
And that’s it! We have extracted our text from a WordPress blog. A next step could be using the parameter after
in our query to easily get all of the posts in the blogs.
And remember, always check if the blog is scrappable!
Special thanks to Dr. Regis Durazo for help editing.
[1]: WordPress. (n.d.). WordPress.com https://wordpress.com