This blog post will haunt your computer forever. You have been warned! :D
In this article, we are going to present different techniques to scrape a Ghost blog (especially with the default Casper theme). We will use CasperJS, which is a scripting utility for PhantomJS (WebKit headless browser) and SlimerJS (Gecko minimalist GUI browser).
A regular CasperJS script is composed of steps and has the following structure:
Casper.prototype defines lots of useful methods to interact with webpages and the most powerful one is probably
evaluate() that gives a direct access to the DOM…
Printing a list of the latest articles to console
OK, so… Let’s take my personal blog in French as an example. In the following script (ghost.js), we get a list of the latest articles (these ones are displayed on the homepage):
To run the script:
casperjs --engine=slimerjs ghost.js
Here is the output:
So far, so good! But what if we want to get a full list of all articles?
Printing a list of all published articles to console
An elegant solution to achieve this goal is to use recursion. Our scraper starts on the homepage, tests if there are older posts (using pagination as a reference), and goes from page to page thanks to an IIFE (Immediately-Invoked Function Expression) which calls itself.
Cool! Getting lists of headlines is funny, but is it really useful? RSS feed readers already do this for us without efforts. Would it not be more interesting to extract all articles and create appropriate files to store local copies automatically? Let’s do this right now!
Extracting all articles to save local copies in filesystem
Extracting all articles is not so easy without a clear methodology. Our script should:
- Start on the homepage
- Visit the first article
- Extract this article
- Go back to the homepage
- Visit the second article
- Go to the next page
- Visit the first article of this page
- End when the last article from the last page has been extracted
It would be easy to define an array of predefined URLs, but this is not what we are going to do. This solution is not viable because each new article would not be taken into account by our script. In fact, we have to simulate the behavior of a user who would click each article in sequence to download all blog posts.
Here we will need to interact with the filesystem, so the fs module is required. For this script, we keep the recursive IIFE:
Done! This script will create a local
articles/ directory containing articles/files with raw textual content.
CasperJS is a powerful automation library on top of PhantomJS and SlimerJS. Moreover, thanks to a clean HTML code, a Ghost blog is a pleasure to scrape!
If you want to learn more about these amazing technologies, maybe you can try to scrape articles filtered by author, by tag or by date… This is not very difficult with the code provided here, but this is a good exercise. You may also be interested in the Ghost API.
Spectres have never been so kind!