Article CuratorI was on vacation recently and had some time to start up a side project, and so I created a custom article curator for a website I'm working called The Tamriel Underground. I wanted to automatically scrape sites and pull in articles that I could link to and show on my site, so that I could have automatic content populate the articles section.
And it works fairly well for the initial build. I used CasperJS for the web scraper and NodeJS to spin up a web service which would save the data to a Postgres database.
The CodeBoth files are only about 200 lines of code combined, so it's not a ton of code to walk through, but this was my first time using Casper and Phantom so it took me a bit of work for those 200 lines.
The web server gets run constantly and stays up using node forever, and the web scraper is run as a cron job every 2 hours.
First let's take a look at:
The goal of this script is to access the Elder Scrolls Online Website, log in through the Age Gate, and then scrape the Patch Notes site for the links to the patch notes. Then we iterate through those links to pull down the html used in the article.
So at the top of our file we have our declarations:
Here, we specify our urls that we're going to be accessing, we initialize Casper and we create an array and object to store our data as well as set a few settings. Pretty straight forward nothing special here.
Next, we're going to skip over a few lines down to line 87.
Here is the starting point for Casper. We open up our age_gate_url which points us to the page we need to fill out before going any further, then we wait for the page to load by waiting for the selector to show up, and then we fill out the form.
Next we open up the page where we want to pull our patch note links from and call out to our getPatchNoteLinks function which we'll look at later, and evaluate that within the page.
Finally we run Casper. This is what actually makes all the steps we've declared so far happen. And once all those steps are done, and we have all of our patch notes links, we make the call to getNotes, which is the callback after Casper is finished running.
Now we start parsing those links we've accumulated.
getNotes iterates through each link, calling Casper.run at the end of the link processing, recursively calling itself at the end of each run, until all links are processed, in which we call out to our save function.
You see the evaluate statement used again below in the curateLink function as well. Here we're taking a link, we're using evaluate to grab the title from the page and return it, then we grab the entire article which is between the article tag, and then we add all of that information to our patch_note_info object.
And finally, once all this is done, we call out to our final function, saveInfo.
The reason that we have two parts to this application is that CasperJS and PhantomJS are NOT node applications, and thus there are no server side database drivers (that I know of) built for them for Postgres. However we CAN post that data to a separate NodeJS server that will store it for us.
saveInfo does exactly that. We post our stringified object with all of our data to our NodeJS server and let it do the saving for us.
This also allows us to keep our curators light weight. We can have a single location that handles saving to the database, and the curators can simply handle the scraping and POST it when done.
So let's look now, at our simple web server.
Again, we have our basic declarations and requires at the top:
- express - a nice REST framework
- pg - this is our Postgres connection module allowing us access to our database
- body-parser - this allows us to easily parse JSON data posted to our server
- async - a nice framework for iterating through database calls with different datasets.
And then we set our connection string and start our app.
Then we have the bulk of our web server.
We include our body parser json so we can parse the data sent easily.
Then we have our one endpoints POST /save. Here we grab our articles, we pull out our sites and push the data to an array. Then we connect to our database, and check if those sites already exist in the database. If the sites exist, we'll skip them, otherwise we'll add the data that we don't have yet to the database.
And that is exactly what these two functions do. insertSites will takes the sites we don't have already stored, and iterate through each one with the async library. This is where the async library is convenient. It allows us to easily iterate through asynchronous methods which we need for the database insertion.
And that's all there is to the Curator!
That data then gets consumed by the Tamriel Underground, which is a Django application and converted to a model. I lined up the model schema in python and the way I'm entering the data and it works quite well.
The articles get added to the page, and you can view them with the plus button.
Right now I'm using the html I get from Elder Scrolls Online as is, but I'm going to create a cleaning function to strip out anything that might be invalid. Since I'm going to be scraping the html and displaying it on my site, I want to make sure that no one can insert any nasty script tags or links that might cause a security concern.
Since I know the source that I'm scraping I'm not terribly concerned at the moment, but as I add more sources to the curator, I'll definitely want a robust input cleansing module.
So that's pretty much it, feel free to pull the Github repo down and play around with it and let me know what you think.