On September 5, 2017 from 3:20 PM EDT to 7:18 PM EDT, Codeship Basic builds were delayed in processing, and there was an elevated level of system initiated build failures. Codeship users saw that their build pipelines would not start running immediately after they were created as there was contention for servers in our elastic build infrastructure. In some cases, builds were marked as failed with the system error state.
The Codeship Basic elastic build infrastructure relies on compute resources provided by Amazon Web Services EC2. We maintain a dynamic fleet of instances that we scale up and down, automatically, based on system demands. This is driven by the number of concurrent build pipelines across our customer base which naturally fluctuates over the course of a day. As is to be expected with cloud infrastructure, compute resources will fail and software should be designed to handle those situations.
On September 5th, three EC2 instances that were actively running Codeship Basic builds became unavailable. Codeship’s software is intended to be resilient to this situation and restart those builds on available infrastructure with free capacity. A defect in the software caused new builds to restart for each pipeline that had currently been running on each of those unavailable machines, instead of only restarting unique builds that were affected by the failure. One affected instance had a lot of pipelines running on it from a build that was configured with fifty concurrent pipelines. The impact of the defect was that we rescheduled the entire build and all of its pipelines for each pipeline that impacted by the failed instance. This code path was executed numerous times resulting in a very large number of duplicate and extraneous builds that required substantial resources for all of the pipelines. Auto-scaling routines were triggered on the Codeship Basic elastic build infrastructure as result of the workload, however AWS struggled to meet our demand, and our centralized Redis instance exhausted available connections. At the peak of the incident, over 13,000 build pipelines were waiting for available build machines on Codeship Basic. In normal circumstances, no pipelines are waiting, and our build infrastructure has reserve capacity.
What did we do to stabilize the system?
In order to stabilize the system during the incident, we scaled up workers responsible for processing the queued backlog of events generated by all running builds. We closely monitored as our systems scaled up to handle the large influx of builds and ensured we had build infrastructure available to process the backlog. In parallel, we identified the builds that were errantly started due to the defect and terminated those, allowing for normal builds to use available capacity. Some of those errant builds were running and some were queued. As we exhausted Redis connections, we took action to scale workers back to within the limits while still able to successfully process the queues.
What will we do to prevent future occurrences?
To prevent an incident of this type of happening again, we have fixed the defect in our restart logic and deployed a fix to Codeship Basic. If a build fails due to an underlying compute resource termination, our system is aware of the build ID and restarts the pipeline in the context of the build that it belongs to, instead of triggering a new build for every failed pipeline. We are also increasing the size of our shared Redis instance to ensure that we do not exceed our connection limit during times of unusual high load. Additionally, we have scheduled work to diversify the types of compute resources that the Codeship Basic platform can use for build processing in order to minimize the risk of instance terminations.