On July 26, from approximately 13:45 to 23:45 UTC, Codeship Pro builds did not process as expected. This outage coincides with an outage of Quay.io, a Docker image registry that is a dependency of Codeship Pro.
The Codeship team deeply regrets the inconvenience and disruption this outage caused.
Our build machine management service, which interacts with Quay.io, was unable to bring new Codeship Pro build machines online, leading to increased build allocation times because our fleet of build machines was unable to scale.
During a Codeship Pro build, your build containers are controlled by a Docker container running Codeship Pro’s build supervisor. The Docker image for this supervisor container is stored and distributed via Quay.io. When the build machine management service brings new Codeship Pro build machines online, part of the provisioning process is to download the Docker image. Without that image, Codeship Pro builds can’t be run, as the supervisor container is a critical component. For this reason, if the image pull fails during provisioning, a provisioning failure occurs and the machine is not added to the pool of eligible machines. Some build machines did provision successfully during this period, and builds were allocated to those machines.
However, some builds that were able to have a build machine allocated successfully were also impacted by this incident.
When a Pro build starts, the machine downloads an updated version of the supervisor image from Quay.io in order to ensure the latest version of the image is used, and all build components are in sync. If the pull fails, the build errors and fails. This behavior is intended to protect the build from service interface mismatches and other issues that could originate from a recent deployment of one of Codeship Pro’s build services, and those services getting out of sync (i.e. the build supervisor receiving a message that it is not equipped to receive due to a previous version of the Docker image).
Because Codeship Pro build components are deployed via Codeship Pro, we were limited in our ability to update our system components via our normal CI/CD pipeline, though we do have alternative processes in place for situations like this.
From 14:15 to 19:33, pulling from Quay.io was working as expected, though the Codeship engineering and customer success teams chose to keep our own incident status as "Monitoring” since Quay.io did not resolve their incident. According to our metrics and dashboard, builds were processing as normal, but we didn’t want to falsely claim "Operational" status since we didn’t conclusively know the status of Quay.io.
At 19:33, Quay.io moved their status to “Major Outage”. This caused Codeship Pro to move to Major Outage as well. During this time, some number of builds were still able to run as expected. Codeship does not rely on the full suite of functions that Quay.io offers, such as image building, and it appeared that registry pull interactions were still somewhat operational at this time.
Our team focused on stabilization during this incident, since removing Quay.io as a dependency and introducing different error handling and fallback methods would take at least a medium effort. Paired with the need for manual deployment, our team chose to focus on stabilization. To stabilize our systems during this period, we took manual steps to increase availability of machines in the Codeship Pro pool.
Currently, all errors from Quay.io are retried, though we have historically seen intermittent build failures due to these errors. In the case of a major outage of an upstream dependency like this one, our strategy of retries was not effective.
To address the aforementioned intermittent failures, our engineering team already had work scheduled to move the supervisor image from Quay.io to Amazon’s ECR, as well as add redundancy in the image storage and distribution process. These initiatives will be completed with urgency as a result of this incident.
Planned Corrective Actions
Again, Codeship takes our responsibility to test and deploy your software very seriously, and we apologize for this service disruption and understand the impact it has on your operations.