The majority of Raygun runs on a .NET based technology stack running on a mix of Windows and Ubuntu boxes (using Mono). But we are always open to new technologies and like to use the technology that best suits the problem at hand. We recently ran into some scaling issues with the Raygun API that receives your errors. This post highlights a few reasons why we switched this particular area of Raygun over to using Node.js.
We received a few new customers who were pumping significant error data our way. Fortunately we were able to spin up new machines and handle this new load. Unfortunately all these new machines added up to a reasonable increase to our hosting bill. We needed to be able to handle this new load with a lot less hardware in order to keep Raygun cost efficient for our customers (that’s you hopefully!).
Backing up a little bit, and providing more context, this story goes back to about August 2013. We’d been processing several million data points per day, and one evening the data volume jumped to over 100,000 crash reports per second!. Mono is a cool bit of a tech, but it’s not so great for handling that type of web load. We needed to look at a way of handling this volume. While we looked into a solution, we simply scaled out the API nodes to handle the volume – 12 EC2 instances added all up.
The Raygun API has quite a simple job, it needs to receive HTTP requests, mainly consisting of a JSON body. It then needs to do some authorisation to ensure the provided API key is valid and some small checks on the data to ensure that it is dealing with valid Raygun error data then it needs to pop this data onto a queue. These are all tasks that are ideally suited to Node.js and its event loop driven architecture.
We started with a clean slate and so we didn’t have any of our existing code to leverage while building this out. Fortunately one of the best parts of developing with Node.js is being able to lean on NPM and its large pool of existing modules for Node.js. Some of the modules that we leverage are:
After building out the replacement API and performing plenty of load tests with siege we were happy we had significantly increased the load that we could handle per machine. The next step was to cycle in the new API into our pool of API nodes so that we could run both side by side. This confirmed our testing results and we rolled out the changes across all of our API nodes. We were then able to reduce the number of machines required to serve all of your errors significantly.
Ultimately, it took us only a couple of days to get the API nodes rebuilt using Node.js. We reduced the API box count down to 3 (we wanted to maintain redundancy), with 60% free capacity, even at the extreme volume of events being received. The nice thing about this is that due to the efficiency gains, we know that our architecture can scale remarkably well with our customer growth. Performance is an asset to our business and something that we continually invest in.
Node.js is a worthy addition to your toolkit. It’s not a magic hammer to solve all your problems though. If your workload is CPU intensive and can’t leverage the event loop then it will definitely not be the right tool for the job. In this use case it fit in well and allowed us to serve more users with a lot less hardware.
Create your account
Every Raygun user gets a 14 day trial – no credit card required. You can sign up in a matter of seconds. It couldn’t be easier to be improving your software quality.