Dashboard down?

steve

Posted on
Jul 30 2014

As of 9:30AM EST access to: app.raygun.io/dashboard/ has been extremely slow and or unavailable. I have tried from multiple geographical locations and get the same result


Alex Webber

Posted on
Jul 30 2014

Same here, it's dead for me too.


Jeremy Steele

Posted on
Jul 30 2014

Ditto. Been down about half an hour now. Hopefully errors are still getting posted...


Alex Webber

Posted on
Jul 30 2014

You should be safe there Jeremy, although these guys would benefit from a basic status page.


steve

Posted on
Jul 30 2014

Not sure what the SLA is for RayGun but an hour of down time is pretty much unacceptable


Alex Webber

Posted on
Jul 30 2014

SLA only applies if you've got an enterprise plan from what I can see. We run an internal ELMAH instance for backup should the dashboard go down, which it has done before. They've never lost any of our errors though.


KennyBu

Posted on
Jul 30 2014

Same for me, any updates Raygun peeps?


Jeremy Steele

Posted on
Jul 30 2014

Well, mine just loaded for a second, and now its back to not loading. Lame :(


Alex Webber

Posted on
Jul 30 2014

Same. An incident update of any description from Mindscape would be nice....


Stefano Verna

Posted on
Jul 30 2014

The same situation here.


Jeremy Steele

Posted on
Jul 30 2014

Seems to be loading now. (noon est)


steve

Posted on
Jul 30 2014

incredibly slow now (2:20 EST)


John-Daniel Trask

Raygun

Posted on
Jul 30 2014

Hi Everyone,

I'm really sorry about the outage of the app dashboards there - Jeremy from our team was working on it. One of the data nodes in our cluster effectively locked up and was very slow at coming back online (as some more detail, this was an Elastic Search node). While that node was locked up, or coming up, the application would either time out loading or very slowly load. In theory, there is enough redundancy in that cluster that the others should pick up the slack and carry on - but they became overwhelmed and hence either pages were loading very slowly, or timing out.

No data coming in was lost.

We've designed Raygun so that there is very high resiliency in the inbound data API's. Obviously we need to do something better with the resiliency of the app - there has been a few unacceptable slow downs when ElasticSearch has had issues. We'll be meeting today to discuss how we can improve that. We'll also look at creating a status board to be more clear.

Again, I'm very sorry about the issues that have occurred here - it's never a good look and the last thing we want to do is cause you problems.

John-Daniel Trask
Co-founder
Mindscape Limited

P.S. If you'd like to discuss any specifics to you, don't hesitate to email me direct: jdtrask@raygun.io


Alex Webber

Posted on
Jul 31 2014

Down again?


Stefano Verna

Posted on
Jul 31 2014

Yep... and today we're having the public launch of our app... :/


Jeremy Steele

Posted on
Jul 31 2014

Yeah. No good guys. Going down mid-day eastern is pretty lame.


John-Daniel Trask

Raygun

Posted on
Jul 31 2014

Hi Alex, Jeremy & Stefano,

Yes, unfortunately we did have 2 outages - not to the extent of the issues on the previous day, and impacting far fewer accounts. Absolutely, it was lame, and this is turning into a perfect storm of a week of problems with our ElasticSearch custer.

What actually happened

Our data cluster, using Elastic Search, had two "split brain" situations today, causing some customers data to not load properly when requested.

If you've not used ES, you can skip this paragraph, but I'm providing it to give more detail on what happened. In both outages today, one of our nodes effectively "split brain", leaving the cluster, and then thinking it was a master node. This is why the outage today impacted a subset of our users. This happened twice through the day. We typically see this when our hosting provider has some internal network issues - typically latency between nodes that's high enough for a node to think it's alone, and should become master (but in the real world, it shouldn't).

Jeremy on our team rejoined the node to the cluster manually when this happened both times. When you do this with ES, it runs through a process of verifying the new node, syncing in any changes, rebalancing etc. It can take a little bit of time to be readily available.

This, of course, doesn't change the fact that we did have an outage. I just wanted to share what happened, and how we're working to ensure this doesn't happen again.

What we're doing

The outcome of our meeting yesterday was that we would be migrating our ElasticSearch cluster to a new set of servers with Amazon (currently they are with another provider that we'd picked for faster IO, but now Amazon has a good SSD story it makes sense to bring it back inside AWS). This will take a bit of time to do simply due to the volume of data we have.

Issues like the "split brain" with ES have a pretty simple process to fix most of the time, so we're looking at automating a resolution when that occurs, this should help reduce the time to resolve the issue if/when it occurs.

We're looking at how the cluster is configured to see if there's a way to ensure that one rouge node won't affect any customers (since we already have redundancy in the data, being replicated between several nodes).

At present, when ElasticSearch has issues, the dashboard fails to load. Along with a status page, I'll be having one of our team improve the dashboard to actually state if there's a communication issue with our data cluster (so you're not left wondering if it's something with your internet connection).

There is a fair bit of work there, and we've started on it already.

I'm really sorry about this outage, it's been a terrible week for issues. Our whole team takes reliability & quality seriously - it's why we built Raygun in the first place - so we're very sorry for letting you down. We're working as fast as we can at trying to make sure it doesn't happen again, and we have members of the team available around the clock until we've implemented these improvements.

As always, this did not result in any data loss, just querying for the dashboard stopped working.

John-Daniel Trask
Co-founder & CEO
Mindscape Limited


Jeremy Steele

Posted on
Jul 31 2014

Thank you for the detailed explanation! Hope you guys can get all this sorted.


Reply