This morning at around 3am UTC, one of our cloud servers had an issue, which caused the frontend of our website to go down. During this time nobody was able to access any Stacker portal.
At around 7:30am UTC, our engineering team noticed the problem, and had deployed a fix by around 8:00am UTC.
I want to apologise for this downtime. We take the uptime of our product really seriously, so this represents a major incident for us. Over the next week we’ll be puttting in measures to prevent this from happening in the future.
At around 2:45am UTC, one of Google Cloud’s node scheduler jobs removed one of the nodes from our production cluster. We’re still investigating to work out whether this was due to an error, or scheduled node upgrade.
We run on a multi-node cluster, so normally this wouldn’t have caused any incident as all the pods would have been evicted to that node and scheduled to a different one. Unfortunately, the part of our frontend infrastructure that issues and stores SSL certificates has a requirement to all be on a single node, which we denote using a node label.
This meant that even though a new node was provisioned, there were no nodes that met the frontend pods' selection criteria, so they were not able to be scheduled.
This means that our front-facing web server was down, and unable to receive traffic from outside our cluster to either serve our frontend or respond to API requests.
At 7:58 UTC we had identified the issue and correctly applied the required label to the new node. By 8:00am UTC, the frontend was running again.
In order to mitigate this problem in the future, we will be changing how this part of our infrastructure is architected, to remove the requirement to have a single node for our persistent storage, which will allow node upgrades to occur without frontend downtime.
We will also be putting in some better monitoring and alerting, so that we’re able to respond to incidents like these much more quickly, even if we’re asleep when they’re happening.
* * *
Thank you all for your patience while we resolved this issue. This is well below the level of service we endeavour to provide, and so I’d like to apologise again for the downtime.
— Michael, CEO Stacker