Scraping Websites with X-ray, AWS Lambda and Serverless

For my BrickCompare project, I decided to a microservice to do the task for scraping websites for the pricing data I needed. Amazon Web Services with their AWS Lambda service was the perfect service for the task.

Scraping Websites with X-ray

I had already decide to use the node.js platform to run my microservice as I was familiar with it. So then I had to select a module that could get me started on website scraping. Initially I selected noodlejs as it looked to be easily to use and had decent documentation. But after writing about 10 or so scrapers for different websites, I found that it was rather buggy and did not return consistent results.

With noodlejs having no development in almost 2 years I decide to find a new module for my needs. I eventually settled on the x-ray web scraper. It was much easier to use, more consistent and had a much nicer API. I would thoroughly recommend this library for scraping websites.

However, x-ray was only good for scraping statically loaded webpages, as it did not render the Javascript code of websites so websites which dynamically loaded with AJAX calls could not be scraped this way. After some more research, I plan to experiment with Nick.js as it run with a headless chrome to render and scrape websites. The only blocker for me right now is that it requires the node.js 8 but AWS Lambda only supports a 6.10 runtime.

AWS Lambda Deployed with Serverless

AWS Lambda service provided a way to run “functions”. So I could upload some code and the service would run the code when triggered. In this case I could upload node.js Javascript code and it would run without setting up any servers. Hence the “serverless” architechture.

The Serverless framework provided a command line interface for me to easily deploy my website scrapers to AWS Lambda. All the configuration is specified in a YAML file and Serverless handled the rest through the AWS API. I could even invoke the remote functions through the Serverless command line.

For BrickCompare, I configured it to trigger every three hours and Serverless handles the AWS configuration for me. To be honest it was like magic, it was that easy to use. Whether or not Serverless configured it in the most efficient way I am not so sure as I did run into some limits as the number of my website scrapers increased.

Microservices are the Future

I found that AWS Lambda with Serverless was a great way to run my scrapers and i could easily scrape a few website pages for the data I needed in a matter of seconds. And the best part is AWS is pretty generous with the Lambda free tier, as I am still within those limits. So everything was done with no zero cost!

Leave a Reply

Your email address will not be published. Required fields are marked *