Codeship Basic builds delayed

Incident Report for CodeShip

Postmortem

On February 22, from about 19:00-22:40 UTC, some customers on our Basic build platform experienced a delay in build processing. This particular incident’s system of root causes relates to the way our infrastructure management system recovers from unexpected compute infrastructure termination.

We expect that compute infrastructure will unexpectedly be terminated and have recovery routines in place to replenish our fleet of build machines. However, during this incident, our system struggled to obtain reliable replacement machines, sending us into a recovery loop and putting stress on other components of the system, like our messaging system. In addition, the Basic build system will automatically reschedule pipelines that were scheduled on affected infrastructure. This created a big queue of pipelines to process, without having adequate build infrastructure. The incident correlates to observed instability from our infrastructure provider.

We were alerted to the incident via our incident management system, and because customers submitted tickets to our CS team to let us know something was wrong (thanks!). Our engineering team took steps to stabilize the system be requesting additional compute resources from our cloud provider, AWS, and manually draining queues unrelated to build processing in order to prioritize stabilization actions.

After performing an internal post-mortem about this incident, our engineering team will complete corrective and preventative actions to ensure this type of incident does not happen again. In addition to adding richer alerts to be notified more quickly, we will also add additional guards to our infrastructure management system to recover more predictably from large amounts of unexpected infrastructure terminations, and the “recovery loop” scenario where our replacement machines are also terminated.

We regret this period of degraded performance, and know that our customers put a lot of trust in us to test and ship their code. We’re continuously focused on improvements.

Posted Feb 28, 2018 - 17:09 UTC

Resolved

This incident has been resolved.

Posted Feb 22, 2018 - 22:40 UTC

Monitoring

Codeship Basic builds should be stabilizing now and we are continuing to monitor.

Posted Feb 22, 2018 - 21:45 UTC

Identified

We have identified the issue and are working on corrective actions now.

Posted Feb 22, 2018 - 19:42 UTC

Investigating

We are seeing delays with Codeship Basic builds starting and are investigating further.

Posted Feb 22, 2018 - 18:57 UTC