Monday, June 22, 2015

Article Curator - Web Scraping with Casper JS and Node.

Article Curator

I was on vacation recently and had some time to start up a side project, and so I created a custom article curator for a website I'm working called The Tamriel Underground.  I wanted to automatically scrape sites and pull in articles that I could link to and show on my site, so that I could have automatic content populate the articles section.

And it works fairly well for the initial build.  I used CasperJS for the web scraper and NodeJS to spin up a web service which would save the data to a Postgres database.

You can view the whole project on my Github.  It's just 2 JavaScript files, one for the Scraper and one for the Web Server.  It's a fairly small project at the moment, but I plan to build a few more scrapers for other sites to pull in data.

The Code

Both files are only about 200 lines of code combined, so it's not a ton of code to walk through, but this was my first time using Casper and Phantom so it took me a bit of work for those 200 lines.

The web server gets run constantly and stays up using node forever, and the web scraper is run as a cron job every 2 hours.

First let's take a look at:

eso_pnote_curator.js

The goal of this script is to access the Elder Scrolls Online Website, log in through the Age Gate, and then scrape the Patch Notes site for the links to the patch notes.  Then we iterate through those links to pull down the html used in the article.

So at the top of our file we have our declarations:


Here, we specify our urls that we're going to be accessing, we initialize Casper and we create an array and object to store our data as well as set a few settings.  Pretty straight forward nothing special here.

Next, we're going to skip over a few lines down to line 87.


Here is the starting point for Casper.  We open up our age_gate_url which points us to the page we need to fill out before going any further, then we wait for the page to load by waiting for the selector to show up, and then we fill out the form.

Next we open up the page where we want to pull our patch note links from and call out to our getPatchNoteLinks function which we'll look at later, and evaluate that within the page.

Finally we run Casper.  This is what actually makes all the steps we've declared so far happen.  And once all those steps are done, and we have all of our patch notes links, we make the call to getNotes, which is the callback after Casper is finished running.

Now we start parsing those links we've accumulated.


getNotes iterates through each link, calling Casper.run at the end of the link processing, recursively calling itself at the end of each run, until all links are processed, in which we call out to our save function.


Above you can see the three functions we've used so far.  The first one, which we ran in the evaluate statement is the middle one, getPatchNoteLinks.  Now this function is run inside the loaded page.  It's just like client side javascript or if you ran it in your console.  And it will return the value back to our PhantomJS/CasperJS 'scope'.

You see the evaluate statement used again below in the curateLink function as well.  Here we're taking a link, we're using evaluate to grab the title from the page and return it, then we grab the entire article which is between the article tag, and then we add all of that information to our patch_note_info object.

And finally, once all this is done, we call out to our final function, saveInfo.

The reason that we have two parts to this application is that CasperJS and PhantomJS are NOT node applications, and thus there are no server side database drivers (that I know of) built for them for Postgres.  However we CAN post that data to a separate NodeJS server that will store it for us.

saveInfo does exactly that.  We post our stringified object with all of our data to our NodeJS server and let it do the saving for us.

This also allows us to keep our curators light weight.  We can have a single location that handles saving to the database, and the curators can simply handle the scraping and POST it when done.

So let's look now, at our simple web server.

article_server.js

Again, we have our basic declarations and requires at the top:

For this app I used express, simply because I'm familiar with it and it's easy to get a simple REST server up and running, however there are much lighter weight options out there for something small like this.

We require:

  • express - a nice REST framework
  • pg - this is our Postgres connection module allowing us access to our database
  • body-parser - this allows us to easily parse JSON data posted to our server
  • async - a nice framework for iterating through database calls with different datasets.
And then we set our connection string and start our app.



Then we have the bulk of our web server.

We include our body parser json so we can parse the data sent easily.

Then we have our one endpoints POST /save.  Here we grab our articles, we pull out our sites and push the data to an array.  Then we connect to our database, and check if those sites already exist in the database.  If the sites exist, we'll skip them, otherwise we'll add the data that we don't have yet to the database.


And that is exactly what these two functions do.  insertSites will takes the sites we don't have already stored, and iterate through each one with the async library.  This is where the async library is convenient.  It allows us to easily iterate through asynchronous methods which we need for the database insertion.

And that's all there is to the Curator!

That data then gets consumed by the Tamriel Underground, which is a Django application and converted to a model.  I lined up the model schema in python and the way I'm entering the data and it works quite well.

The articles get added to the page, and you can view them with the plus button.



Right now I'm using the html I get from Elder Scrolls Online as is, but I'm going to create a cleaning function to strip out anything that might be invalid.  Since I'm going to be scraping the html and displaying it on my site, I want to make sure that no one can insert any nasty script tags or links that might cause a security concern.

Since I know the source that I'm scraping I'm not terribly concerned at the moment, but as I add more sources to the curator, I'll definitely want a robust input cleansing module.

So that's pretty much it, feel free to pull the Github repo down and play around with it and let me know what you think.

Cheers,
Jason C.

Friday, May 29, 2015

Green Lit

Well I can't believe it happened, but SBX: Invasion finally got Greenlit and will be on Steam on June 1st!



It's been a long time coming and it's been an awesome experience!  At first when I got the e-mail that SBX Invasion had been Greenlit, I thought it was a phishing scam, so I skeptically opened a separate tab and went to the steam page.  And then it hit me: a game I had made was going to be on Steam, a game that had sat in the Green Light process for almost 15 months.

I got excited.  And then I got to work.  I had to polish a few bugs I knew of, create new builds, get it compatible with the Steam upload system, create new art assets for the store, create a new promo video, and learn how Steamworks Builder worked.

I spent a couple weekends preparing and getting everything together and set the launch date to June 1st.  And now it's almost here.

I don't know what this will mean for SBX: Invasion, but a small part of me feels like I've succeeded, just a tiny bit.  That feels good.  

So if you haven't yet, check out the Steam Page, or check out my Youtube Promo Video

Or drop me a line on Twitter.

Cheers and hope to have some awesome things in store to share soon!

Jason C / WakeskaterX


Tuesday, April 28, 2015

Tamriel Underground

I've been working on a fan site lately; a home for the Thieves and Scoundrels of Elder Scrolls Online to gather and share news.

I present to you, the Tamriel Underground:


Tamriel Underground is a members only site for the Thieves of Tamriel to share tips, tricks, maps and guides.  It will be a place to post Thieving challenges and share your exploits.

Currently only the Maps Tab is active, and I've taken maps from the Rogue's Folio (by reddit users /u/anemonean -- props) and uploaded them and organized them to be refined and searched on.


You can view the maps in a list, and refine them at the top or use the search bar to refine the list down to what you are looking for.  Once you click on the map image you want, it pops up in a lightbox, letting you see the full map.


The website is in it's early phases, and there isn't a lot of content at the moment, but the plan is to create a massive collection of all of the best Thieving tips and tricks and to keep away all those pesky Guardians that will eventually be trying to steal our secrets.

Come join The Underground, where the night is your friend and stealth is your weapon.




Thursday, March 5, 2015

NodeJS - A Little Experiment in Load Testing and Clustering

Load Testing NodeJS on Multiple Cores

Check Out the Git Hub project HERE

Something I wanted to investigate was using Node JS on a multi core system to do CPU intensive applications under high load, so I created a project to do just that.

There are a few ways to use multiple cores in NodeJS, two of which are Cluster (part of the NodeJS API) and WebWorker Threads.

I will preface this by saying I am no NodeJS expert, this project was simply for learning on my part and there may be a better way to go about this with an NGinx set up, but I wanted to share my findings and maybe you'll find it cool too.

The Planning Stages

I started the project out with a simple NodeJS and Express server, which when called with a number in the query, would calculate and return the Fibonacci sequence for that number to simulate high CPU calculations.

My first thought was to use Web Worker Threads to spin up a thread to do work on, and just pass in the data and let it work on a background thread, but this caused issues under high load as too many threads were being spun up (I had no cap at the time) and at the start they were descoping and causing segmentation faults.

I ended up fixing the Seg Fault issue, but even so, they were burning through all of my VMs 2GB of memory and crashing.  

I attempted to create a Worker Pool function to control the amount of threads at any one time, but it quickly got quite complicated.  The idea was to use a queue and simply drop information off and process when ready, but with web requests and waiting on data, this as well got very complicated.

Then I stumbled across the Cluster section of the NodeJS API and found my solution.  With the Cluster API you can spin up multiple instances of your node server on the same port.  So I decided to test if this truly would improve my performance.

Load Testing

I installed loadtest in order to deliver high concurrency testing to my application and began to test the application.  On one branch of my git project I had the vanilla, single threaded NodeJS server which would generate a Fibonacci sequence, and on my other branch I had my project which generated 4 servers on a single port, which would use up the multiple cores.

This was a MUCH simpler solution to building a multi-threaded system within a single event loop and was extremely easy to set up.

So the load testing began.  I called out to my application like so:

loadtest -t 20 -c 32 http://localhost:3030/fib?num=25

I pinged my application using various concurrency values for 20 seconds testing periods.  I tested up from 1 concurrent connection up to 1000 connections.

While I don't believe all four cores were actually being used (as it was on a VM and my computer didn't crash) I did see quite an improvement using the multiple core approach (as one would expect).

The Results

The results from the load test are as follows.  Load test outputs the maximum response times for 50%, 90%, 95%, 99%, and the maximum response time as well as Requests Per Second, Mean response time, and Total Requests.

Computer Specs:  The VM was an Ubuntu 14 VM with 2 GB Ram, 4 processors.

Each request was made with a Fibonacci sequence of 25 for 20 seconds.

NodeJS Single Event Loop

ConcurrencyCompleted Requests50%(ms)90%(ms)95%(ms)99%(ms)Max (ms)Req / SecMean Lat (ms)
648891135169180245286429150
20081103361276135833007364405480
1000817812353726762615726162564092510

NodeJS Multi-Core with Cluster

ConcurrencyCompleted Requests50%(ms)90%(ms)95%(ms)99%(ms)Max (ms)Req / SecMean Lat (ms)
64150097015519332756075080
200144982524765677561062596330
100014923117219542538361876417181350

As you can see the results from using the multi-core approach were much better than the single event loop, and using Cluster is super easy to do.  

You can fork the GitHub project here:  https://github.com/WakeskaterX/NodeThreading

The main fork is (well master too, but) hostedVM which has the standard deployment of nodeJS and express, and the multi-core approach is on the hostedVM_multicore branch.

Feel free to test it as well and let me know how your results are with better machines than my very low powered VM.

Cheers,
WakeskaterX

Tuesday, January 27, 2015

PlayCrafting Boston Winter Expo - February 24th!

Hey All!  Exciting things have been going on the past few months, such as learning new web Technologies and taking part in GGJ 2015 and all sorts of exciting things!   But, this post isn't about that, it's about...

The PlayCrafting Boston Winter Expo!



Stop by at 6pm at the Microsoft NERD center in Cambridge on February 24th, 2015 and check out tons of awesome local indie developers and their games.

You can sign up at the event bright page here.

I'll be there showing off SBX: Invasion and giving away free game related goodies!

Sign up before Feb 23rd to get a discount and come hang out with your local indies!

-Jason C.

Sunday, November 2, 2014

Converting Visitors to Register - adding an Unobtrusive Pop Up

So after about a month or so of having my e-mail login active on my web page, I have exactly zero sign ups.  This isn't unexpected so I'm working on drawing more attention to the login and register buttons at the top.

A lot of websites see great results from using a full page popup when a new visitor visits the site, but generally when I see these I immediately leave the page, and it's fairly infuriating.

So my goals with the most recent pushes to the website were:

1.  To have a pop up that notifies new users about the registration button, but have it be off to the side and non obtrusive when viewing the website and easily closed.

2.  To make the registration process a bit more intuitive and flow a little easier.

For the first goal, I created a little speech bubble pop up that will show only if a user does not have login information saved on their computer, and when it pops up, it will only show for about ten seconds, and won't show back up for 24 hours.  I might extend this to 48 hours / a week, but I don't think users are visiting my site more than once within any length of time.

Here's an example of the pop up:



I wanted it to be mostly out of the way, and while it does cover up some buttons on the UI, it has a clear close button that closes the box immediately.

The second thing on my list was to make the registration process easier.  Before, the login and registration buttons were tied together, meaning you had to click them both and that would pop up the login prompt, then you had to click the register button, and then register.

These buttons have now been split out into a separate login and register button so that users can choose accordingly and not have to go through an extra click.

Just a few small UI updates that I've been meaning to do to convert more visitors into registered guests.  I'll post an update later on the results of the slight UI changes.


Cheers,
Jason C.

Sunday, October 5, 2014

Automated E-mail Authentication In! Also Anomaly Updated.

So I got the automated E-mailing Systems in the website.  When you register, it will generate an e-mail to the submitted e-mail with a link to authenticate the token.

Right now the token is indefinite, but I'll add in a time based token at some point.  I already laid the foundation for the time token in the database & script, but the system is fairly simple.

As you can see here, the activate.php file just takes a token, does some "very basic" checks on html chars / alphanumeric chars and then simply flips a boolean character in the database saying that this e-mail has been authenticated.



It's important to validate e-mails in this way as you want to keep a high sender reputation from your mail server.  Bounced e-mails are no bueno.

Also Anomaly has been updated to have 10 levels.  The only problem is that it's fairly easy right now because of the rule sets.  I need to add more variety to the rules and include multiple rules on the harder levels to mix things up.  Right now this will let me play around with rule sets, what is fun, what is challenging, etc and get some feedback as well.

I need to mix it up a bit so that the Anomaly is slightly better hidden on the larger levels.  Right now it becomes fairly easy to locate the Anomaly with the ruleset "Adjacent".


Please check out Anomaly and let me know what you think!  I'm always looking for feedback on this kind of stuff.

Cheers,
Jason C.