Error monitoring and exception handling in large-scale software projects

By Joel Hans | Posted Feb 16, 2022 | 8 min. (1682 words)

Large-scale software projects don’t care how many unit tests you put into your code. Or how sophisticated your CI/CD pipeline is. Or how robustly you run blue-green deployments to ease into newly-deployed code. These projects will inevitably find themselves subjected to your users, who will uncover bugs your team didn’t catch and didn’t even think to test for.

Large-scale software is simply too complex to be bug-free, and you can’t test for all of the different ways your users will interact with your application. That’s not to minimize the importance of robust testing suites — they’re invaluable for catching major bugs in mission-critical applications or infrastructure — but rather that your software development lifecycle should also take error monitoring into account.

By proactively preventing bugs with tests and building a full-view system to monitor the errors and exceptions you hadn’t expected or that weren’t caught in testing, you maintain a healthy development cycle for your team. That inevitably means a healthier application for your end users.

Why just testing isn’t enough

Even if your team dogfoods your application, watches countless Hotjar recordings of real-user interactions, and thinks they’ve tested everything, you’re still restricted by your bias on how the application works or how the product team has defined their user stories.

The reality of large-scale codebases is different. When thousands of users are simultaneously exploring your application, they find ways past all of the guardrails your team might have set up, whether that’s with unexpected interactions or by using devices you hadn’t planned for. You might even discover that a staging server, tested on by only your development team, behaves very differently compared to your production infrastructure under heavy load or latency.

How to handle application errors the right way

The wrong way to handle application errors is to offload the reporting to users. You’ve probably seen those Send Error Report popups when one of your applications crashes. Do you ever actually click it and opt-in to sending (potentially sensitive) data to the developer? Have you ever heard back from them about an investigation they made and a fix they’ve deployed to make your life easier? Probably not. There’s no confidence or reward in submitting error reports, and Raygun’s own research finds that only 1% of users actually report the errors they’ve experienced.

Instead, you need to take a proactive approach to identify previously-unknown errors and exceptions without requiring direct feedback from your busy users.

We recommend starting by architecting your software for better error reporting. Establish naming conventions aligned with business logic or related systems — never a single developer’s whims — so that anyone working looking at an error monitoring platform can quickly understand where an issue is coming from and what kind of user it’s affecting. In addition, you should always include version information, identify the feature or unit test associated with it, and create a standard for exception handling that’s clearly communicated across your organization.

Errors and exceptions: What’s the difference?

We’re talking a lot about both errors and exceptions, so it makes sense to spend some time defining what each is. For the sake of clarity and adaptability to as many development teams as possible, we’ll stick to language-agnostic definitions.

Errors

An error is a serious problem that an application doesn’t usually get past without incident. Errors cause an application to crash, and ideally send an error message offering some suggestions to resolve the problem and return to a normal operating state, like asking users to restart the application, refresh their browser tab, or log out and back in again.

There’s no way to deal with errors “live” or in production — the only solution is to detect them via error monitoring and bug tracking and dispatch a developer or two to sort out the code.

Exceptions

Exceptions, on the other hand, are exceptional conditions an application should reasonably be expected to handle. Programming languages allow developers to include try…catch statements to handle exceptions and apply a sequence of logic to deal with the situation instead of crashing.

And when an application encounters an exception that there’s no workaround for, that’s called an unhandled exception, which is the same thing as an error.

Why error monitoring so important for large-scale software projects

Many developers think that the path for bug-free software is simply catching every exception. There are a few flaws in this way of thinking, especially when it comes to large-scale codebases and complex applications.

First, there is no reasonable way for development and product teams to dream up every possible exception in your application and develop lifesaving logic in the code. The development workload would be phenomenally taxing and distract developers from more high-value work even if you could.

Second, exceptions can mask real issues in your code by applying secondary or tertiary logic instead of the function you meant to run in most cases. If you catch every exception and do nothing with that error, you lose that information and can never resolve the underlying bug. Error tracking will give you clear oversight of the root cause.

How to code the application to recover by itself

Throwing and catching exceptions is a clever way to let an application recover from unexpected situations by itself, prevent error states, and provide more meaningful feedback for your end-users.

For example, let’s say the user of your SaaS application is trying to invite a colleague to their organization’s account. When they input an email address into the invitation page, they accidentally leave out the @ symbol, which makes it an invalid address.

The user has just created an exception. Suppose there’s no logic to handle it. In that case, that’s an unhandled exception (an error), that either puts the application into an unrecoverable situation or continuously asks your backend mail server to send emails to a nonsensical address. Either way, the user isn’t getting what they wanted, and the application doesn’t give them any useful information about what happened. You might be completely unaware that either situation has happened.

But instead of creating errors, you can add exception handling to this invitation page to deal with invalid emails. You can test the email string for certain conditions (the try) and respond with helpful messages about what might be going wrong (the catch). Your exception handling would recognize the missing @ symbol and gently remind the user, via the form’s UI, that they need to make a change.

Why is it important to specify which type of exception to catch?

When building out exception handling, you should always create standards and ensure everyone on your team is on board. How you build these standards is based on your programming language of choice and what you’re trying to help users accomplish.

In the end, the goal isn’t to eliminate every potential error using phenomenally complicated chains of exception handling. Instead, try to deal with the easy issues and make sure you know about the complex ones.

One reason you need standards for exception handling, and don’t want to needlessly pursue them all, is that certain exceptions can corrupt data and leave your application in an even worse error state. This example of an invitation form might not be a worst-case scenario, but the logic tracks — if your application sends the invalid email to your database, and there’s no way for the user to delete or edit a pending invitation, they’ll be forced to seek help from your customer support team.

As your team works on error/exception handling, you should carefully consider any logic that writes to your database or dramatically alters the user interface to put up friendly guardrails for your users.

A solution to handle unhandled exceptions with error monitoring

The first thing you need to do is log unhandled exceptions using your well-architected standards, which will include important contextual information to help with diagnosing the bug in your code. For smaller deployments and codebases, this level of error reporting might be enough to catch mission-critical errors.

But for large-scale software projects, you need a place to view logged errors and exceptions to make sense of how often they happen. The best solution is an error monitoring platform, like Raygun’s Crash Reporting, which centralizes all your exceptions into a single dashboard for close observation and reporting on trends over time. For example, Raygun now lets you independently track handled exceptions, unhandled exceptions, and errors that bring your application to a halt.

By pairing these resources together, and thanks to that robust chain of contextual information, you can start to prioritize what to fix and when. Here are some guidelines:

Do: Prioritize errors/exceptions found on production servers over those found on staging.
Don’t: Focus on trivial errors, even if they create many exception logs, just for the sake of cleaning things up.
Do: Work on exceptions that directly impact the user experience.
Don’t: Build KPIs around reducing errors or exceptions by a certain percentage or targeting a specific number.
Do: Prioritize errors related to your user’s personal identifiable information, billing cycle, and functions that could corrupt your database.
Don’t: Ignore an error simply because a user hasn’t mentioned it to your customer service team, dropped a screenshot in your forum, or sent a frustrated Tweet about it. Remember that only 1% of users report their errors!
Do: Look at your error monitoring platform for a sudden jump in specific errors/exceptions, or even types, which could signal an issue with a recent deployment.

Now that you have full visibility into errors and exceptions, you can finally start to see how your users are actually spending time in your application. No trying to write unit tests for every edge case, no recording every single user session to understand their quirks, and no chasing the impossible dream of a 100% bug-free application.

Errors and exceptions are inevitable, so why not use them to improve your codebase? Instead of wasting time on valueless bug-hunts and worrying over every possible try-catch loop, you’re turning errors into opportunities to deliver better user experience.

Claim a free 14-day trial of Raygun Error Monitoring and Crash Reporting.