[post-mortem] 2026-04-24 - Cloud DB Upgrade Failure and Service Disruption

[post-mortem] 2026-04-24 - Cloud DB Upgrade Failure and Service Disruption

Incident Overview

Date: 2026-04-24
Time (PDT):

  • Downtime: 5:47 - 5:52 AM

    • Duration: 5 minutes
      Severity: High

  • Degradation: 5:52 - 6:15 AM

    • Duration: 5 minutes
      Severity: Medium

Date: 2026-04-24 - 2026-04-27

  • Degradation:
    Severity: Medium

Services Impacted:

  • Cloud Portal

  • Email Notifications

  • Push Notifications

  • Cross-Site Layouts

  • Cloud Access (Desktop/Mobile Clients)

Customer Impact:

  • Desktop / Mobile Clients being unexpectedly logged out

  • Some customers cannot connect to their Sites

  • Certain API requests are experiencing high latency and increased failure rates

Root Cause

During a scheduled maintenance window on April 24, traffic was routed to an upgraded version of the Cloud DB service. The deployment process first switched traffic for a test system to the upgraded Cloud DB version. Testing against the test system was successful, so the deployment proceeded to the next phase, which was to switch traffic for all systems.

After traffic from all systems was routed to the upgraded Cloud DB version, the service became unstable under full production load. This caused a complete disruption of Cloud DB API operations from 5:47 AM to 5:52 AM PDT, followed by a degraded service period from 5:52 AM to 6:15 AM PDT.

The upgraded Cloud DB version contained a software defect that did not appear during the initial test-system validation. The issue only became visible after the service began handling full production traffic. Under that load, the upgraded service failed and required rollback to the previous stable version.

A secondary contributing factor was an issue with the new Cloud DB routing switch and failover tool. This tool had been built, as part of the corrective actions, after the previous unsuccessful deployment to reduce rollback time and allow faster traffic cutover between active Cloud DB versions. The tool had been successfully tested in lower environments. However, during the production deployment, the tool failed when attempting to switch traffic for all systems.

Because the deployment was already in progress, the team proceeded using CLI-based commands to complete the traffic switch. The CLI approach did not support the same logic as the routing switch tool. Instead of replacing AWS target groups behind the load balancers, the CLI commands replaced the running service instances registered inside the same target group.

Once the traffic switch was completed, the team confirmed that only the upgraded Cloud DB service instance was registered in the target group. Within a few minutes, issues were detected with the upgraded Cloud DB version, and the team decided to roll back.

Rollback was performed using the same CLI-based approach by replacing the upgraded service instance with the previous stable service instance inside the same AWS target group. After rollback, service was restored, and the team confirmed that only the desired stable instance was registered in the target group. After the rollback was completed, the team verified that only the desired stable Cloud DB instance was registered in the target group. At that time, service appeared to be restored, and the maintenance call was concluded.

However, at 6:59 AM PDT, an hour after the rollback and 20 minutes after the maintenance call ended, the second Cloud DB instance was unexpectedly re-registered into the target group. This caused a portion of production traffic to be routed to the unintended Cloud DB instance, resulting in intermittent performance issues and failed API calls over the weekend.

The re-registration occurred because the Cloud DB ECS service still had the affected target groups configured in its LoadBalancer configuration. After the manual CLI-based rollback updated the target group membership directly, ECS later (an hour later) reconciled the service state against its configured load balancer settings and re-registered the Cloud DB instance back into the target group.

This issue was not immediately detected over the weekend because traffic volume was low and the impact was less noticeable. On Monday, April 27, as production traffic increased, the team observed the degradation and began investigating. The team identified that the unintended Cloud DB instance had been re-registered into the target group, corrected the target group/service configuration, and removed the unintended traffic path.

The issue was fully resolved at approximately 11:00 AM PDT on Monday, April 27.

How We Fixed It

Production traffic was rolled back from the upgraded Cloud DB version to the previous stable version using CLI-based commands.

After rollback, the stable Cloud DB instance was verified as the only registered target, and service functionality recovered.

On Monday, April 27, after renewed degradation was observed, an unintended Cloud DB instance was identified and removed from the target group. The ECS service and target group configuration were corrected to ensure production traffic routed only to the stable Cloud DB version.

Corrective Actions

Short Term

  • Fix the identified software defect and deploy the corrected version

  • Fix and improve the Cloud DB routing switch and failover tool to support reliable traffic cutover and rollback between active Cloud DB versions. The tool should be updated to use a clearly defined configuration file, that includes the required environment-specific values, including service versions, load balancers, listeners, target groups, and ECS service details.

    • Add validation before and after the switch to confirm:

      • the correct target groups are used

      • ECS service configuration matches the intended active version

      • only the expected Cloud DB instance is registered as healthy

      • production traffic is routed only to the intended Cloud DB version

  • Update the deployment and rollback runbook to make the routing switch tool the primary method and CLI-based changes an emergency fallback only.