Incident postmortem: Web server issue
On Thursday 11th January 2018 at 9 am EST, some of our customers may have noticed an issue that created intermittent 500 error pages while viewing apps and crash reports.
I’d like to let you know we have identified the cause of the problem and the issue was permanently resolved by 1 pm EST. We are continuing to investigate why this happened and will update this blog post promptly.
What you need to know
Please be reassured there was no loss of data caused by this issue. All data coming into Raygun via the API and the backend processing of that data remains unaffected. We understand the issue caused loss of access to some key areas of the Raygun application, and we apologize sincerely for any inconvenience caused during this time.
- We had an issue with one of our web servers serving requests to the Raygun web application. With load balancing of requests within the application, only some requests would have been failing with a 500 error. Users who encountered this error would have been shown the error page within the app.
- The root of the error was an IIS application pool in a bad state. This was affecting a third party library that is used to connect to one of our backend databases (redis). As a result, any request to the database would return the data but the third party library would throw a low-level error trying to read the returned data stream.
- The issue was resolved when the application pool was recycled. The byproduct of this is the bad state was cleared, and the new application pool started functioning correctly. All requests to the Raygun application after this time completed successfully.
How did it get into this bad state?
We don’t know yet. We are continuing to investigate how this happened. We’re reviewing our logs, errors to get to the bottom of it. We’ll post an update on this post once we’ve resolved the underlying issue.
What was affected?
Only requests within the Raygun web application. All data coming into Raygun via the API and all backend processing of that data was unaffected. There was no data loss as a result of the errors encountered by users of the web application.
Why did it take so long to resolve? Don’t you have monitoring in place to detect this?
While we do have a very comprehensive set of monitors in place, and we monitor all parts of the Raygun platform via multiple mechanisms, our existing monitoring configuration did not pick up this issue. Since the issue was related to one machine and was contained to a specific third party library that is not used on all pages of the web application, the pages we were monitoring were loading successfully so no monitors tripped. The monitoring of the backend database was also fine as it was not an issue with the backend database.
After we finish our analysis of the error and why our monitoring failed to pick this up we will be implementing new measures to ensure we are alerted to situations like this in the future.
We appreciate your patience and help when resolving this issue, and we are sincerely sorry for any inconvenience caused.
If you have any further questions or concerns, please reach out to our support team here.
John-Daniel Trask CEO and Co-Founder