Monday, June 22, 2015

Article Curator - Web Scraping with Casper JS and Node.

Article Curator

I was on vacation recently and had some time to start up a side project, and so I created a custom article curator for a website I'm working called The Tamriel Underground.  I wanted to automatically scrape sites and pull in articles that I could link to and show on my site, so that I could have automatic content populate the articles section.

And it works fairly well for the initial build.  I used CasperJS for the web scraper and NodeJS to spin up a web service which would save the data to a Postgres database.

You can view the whole project on my Github.  It's just 2 JavaScript files, one for the Scraper and one for the Web Server.  It's a fairly small project at the moment, but I plan to build a few more scrapers for other sites to pull in data.

The Code

Both files are only about 200 lines of code combined, so it's not a ton of code to walk through, but this was my first time using Casper and Phantom so it took me a bit of work for those 200 lines.

The web server gets run constantly and stays up using node forever, and the web scraper is run as a cron job every 2 hours.

First let's take a look at:

eso_pnote_curator.js

The goal of this script is to access the Elder Scrolls Online Website, log in through the Age Gate, and then scrape the Patch Notes site for the links to the patch notes. The script then iterates through those links to pull down the html used in the article.

At the top of the file are the declarations:


Here, I specified the urls that I wanted to access, initialized Casper and created an array and object to store the data as well as set a few settings.  Pretty straight forward nothing special here.

Next, lets skip a few lines down to line 87.


Here is the starting point for Casper.  The start function opens up the age_gate_url which points to the page needed to be filled out before going any further, then it waits for the page to load by waiting for the selector to show up, and then fills out the form.

Then it opens up the page where the patch note links are and calls out to our getPatchNoteLinks function which we'll look at later, and evaluates that within the page.

Finally the script runs Casper.  This is what actually makes all the steps we've declared so far happen.  And once all those steps are done and the script has all of the patch notes links, it makes the call to getNotes, which is the callback after Casper is finished running.

Now the script can start parsing those links it's accumulated.


getNotes iterates through each link, calling Casper.run at the end of the link processing, recursively calling itself at the end of each run, until all links are processed, in which it calls out to the save function.


Above you can see the three functions we've seen used so far.  The first one, which was run in the evaluate statement is the middle one, getPatchNoteLinks.  Now this function is run inside the loaded page.  It's just like client side javascript or if you ran it in your console.  And it will return the value back to our PhantomJS/CasperJS 'scope'.

You can see the evaluate statement used again below in the curateLink function as well.  Here it is taking a link, using that evaluate to grab the title from the page and returning it, grabs the entire article which is between the article tags, and then adds all of that information to the patch_note_info object.

And finally, once all this is done, it calls out to the final function, saveInfo.

The reason that there are two parts to this application is that CasperJS and PhantomJS are NOT node applications, and thus there are no server side database drivers (that I know of) built for them for Postgres.  However we CAN post that data to a separate NodeJS server that will store it for us.

saveInfo does exactly that.  We post our stringified object with all of our data to a NodeJS web server and let it do the saving for us.

This also allowed me to keep the curator light weight.  I have a single location that handles saving to the database, and then curators can simply handle the scraping and POST it when done.

So let's look now, at the simple web server.

article_server.js

Again, there are basic declarations and requires at the top:

For this app I used express, simply because I'm familiar with it and it's easy to get a simple REST server up and running, however there are much lighter weight options out there for something small like this.

The script requires:

  • express - a nice REST framework
  • pg - this is the Postgres connection module allowing access to the database
  • body-parser - allows easy parsing of JSON data
  • async - a nice framework for iterating through database calls with different datasets.
And then the script sets the connection string and starts the app.



Then comes the bulk of the web server.

First it includes the body parser so that it can parse the data sent easily.

Then it has the one endpoint POST /save.  Here it grabs the articles, pulls out the sites and pushes the data to an array.  Then it connects to the database, and checks if those sites already exist in the database.  If the sites exist, it skips them, otherwise adds the data that it doesn't have yet to the database.


And that is exactly what these two functions do.  insertSites will takes the sites the database don't have already stored, and iterates through each one with the async library.  This is where the async library is convenient.  It allows easy iteration through the asynchronous methods which are needed for the database insertion.

And that's all there is to the Curator!

That data then gets consumed by the Tamriel Underground, which is a Django application and converted to a model.  I lined up the model schema in python with the way I'm entering the data and it works quite well.

The articles get added to the page, and you can view them with the plus button.



Right now I'm using the html I get from Elder Scrolls Online as is, but I'm going to create a cleaning function to strip out anything that might be invalid.  Since I'm going to be scraping the html and displaying it on my site, I want to make sure that no one can insert any nasty script tags or links that might cause a security concern.

Since I know the source that I'm scraping I'm not terribly concerned at the moment, but as I add more sources to the curator, I'll definitely want a robust input cleansing module.

So that's pretty much it, feel free to pull the Github repo down and play around with it and let me know what you think.

Cheers,
Jason C.