Using Raygun to improve Raygun: How we found 263 users experiencing errors

By John-Daniel | Posted Jul 13, 2017 | 5 min. (930 words)

In this post, I wanted to share a recent example of how Raygun can help folks improve their software. Now, normally we’d be writing about one of our many thousands of customers, but today I wanted to write about an issue we solved for ourselves using Raygun Crash Reporting.

Assume nothing, measure everything

One of Raygun’s core values is Assume nothing, measure everything.

This means when we make “improvements,” we A/B test them. This helps us know if it’s a real improvement, or if we just shot ourselves in the foot. We have had pretty bloodied feet, which is why we test things! It’s important to me that our culture is one where the team is never afraid to try things, but they need to be measured and reported on to ensure that the changes are improvements.

A couple of years back, we decided to develop our own A/B Test system. The reasons were numerous, but one key reason was that many of the testing products require JavaScript and are therefore blocked by script blockers. Now, you know that script blockers have become popular. What you might not appreciate is just how many software developers use them! It’s way more than the general population! That would mean our tests would take significantly longer to run.

Further to that, we wanted high performance. Many JavaScript A/B testing tools will try to rewrite the page after it fetches the test data from a third party server. Snore. Ain’t nobody got time for that! We take performance seriously, given we do sell a Real User Monitoring and APM product.

This led to wanting to run A/B tests from the server side, as they integrate tightly with our public website and app. It means our controllers serve views based on the tests that users opt in to. It also allows us to track goals on the server side easily (e.g., did this result in a user signing up?).

And life was good, until…

RedisException: could not connect to Redis Instance

We run Raygun Crash Reporting on Raygun, and catch errors as we introduce them. So when I noticed an alert from Raygun about a Redis timeout I went to take a peek:

Jinkies. This error had sneaked up on us. Most importantly, Raygun could tell me that 263 users had encountered this error recently! Understanding user impact is super important, much more important, than the actual error count.

Now, for this error, it’s fairly obvious what’s happening. As our popularity has grown, we had started to overwhelm our poor little Redis instance that was storing our A/B test tracking data. In doing so, it was starting to timeout some requests. Easy fix to just increase the resources for that Redis instance and add some defensive code to our A/B test code. Now, it won’t blow up the world if it cannot connect.

So what?

You, dear reader, are likely a software engineer like myself. My initial thinking is well, it’s nice to see the data so I can go and fix it. Great that I don’t need to look for a log file. However, consider for a minute that people visit our site from a range of sources. This includes links from other sites, directly typing in the URL, Adwords, etc.

So here’s how we should be thinking about a bug:

Firstly, 263 users having an error is just a pretty poor experience. I don’t want anyone having a poor experience
If 90% of those visitors are not existing customers, then they surely won’t become a customer if they experience an error. That’s potential lost revenue
If 10% of those visitors came from a paid advertising source (26 visitors) where we, perhaps, have paid $5 per click, that would be $130 already wasted on bringing somebody to the site and then dumping them on an error page. That’s real money already wasted, not just hypothetical future revenue (which could be a lot higher!)

So this is much more than just “ah, it’s nice not to have to go find a log file. This is about saving money, not creating a poor experience for people, and saving me looking for log files. It’s a triple win!”

The tip of the iceberg

So, along with the ability to quickly fix this issue, there is a fourth win here.

Guess how many of those 263 users told us about the error? Zero. Nadda. None.

Almost universally at Raygun when customers add our platform, they are blown away by how many errors their application experiences. I can’t count how often I’ve heard “Oh, I don’t think we’d have more than 1000 errors a month” only to have them run the free trial and discover it’s more than a million errors a month that they didn’t know about!

That’s why we use Raygun Crash Reporting to monitor errors. Otherwise, we wouldn’t have even known this issue was occurring, bleeding money and frustrating prospective customers.

That’s a wrap!

I wanted to share this story because I think it’s important to acknowledge the issues we all face in software development. I love hearing stories from our customers about how we have helped them improve their software and wanted to share one of our own.

If you’ve not yet tried Raygun, you can try a free trial and track errors and performance across your whole software stack. Read more here, and hopefully, you won’t have a Redis Exception 🙂