On February 22, from about 19:00-22:40 UTC, some customers on our Basic build platform experienced a delay in build processing. This particular incident’s system of root causes relates to the way our infrastructure management system recovers from unexpected compute infrastructure termination.
We expect that compute infrastructure will unexpectedly be terminated and have recovery routines in place to replenish our fleet of build machines. However, during this incident, our system struggled to obtain reliable replacement machines, sending us into a recovery loop and putting stress on other components of the system, like our messaging system. In addition, the Basic build system will automatically reschedule pipelines that were scheduled on affected infrastructure. This created a big queue of pipelines to process, without having adequate build infrastructure. The incident correlates to observed instability from our infrastructure provider.
We were alerted to the incident via our incident management system, and because customers submitted tickets to our CS team to let us know something was wrong (thanks!). Our engineering team took steps to stabilize the system be requesting additional compute resources from our cloud provider, AWS, and manually draining queues unrelated to build processing in order to prioritize stabilization actions.
After performing an internal post-mortem about this incident, our engineering team will complete corrective and preventative actions to ensure this type of incident does not happen again. In addition to adding richer alerts to be notified more quickly, we will also add additional guards to our infrastructure management system to recover more predictably from large amounts of unexpected infrastructure terminations, and the “recovery loop” scenario where our replacement machines are also terminated.
We regret this period of degraded performance, and know that our customers put a lot of trust in us to test and ship their code. We’re continuously focused on improvements.