[post-mortem] 2026-04-08 - Cloud DB Upgrade Failure and Service Disruption

[post-mortem] 2026-04-08 - Cloud DB Upgrade Failure and Service Disruption

Incident Overview

Date: 2026-04-08
Time (PDT): 5:11 PM - 5:39 PM (00:11 - 00:39 UTC)
Duration: 28 minutes
Severity: High

Duration breakdown:

  • Degradation: ~20 minutes

  • Service Disruption: ~8 minutes

Services Impacted:

  • Cloud Portal

  • Email Notifications

  • Push Notifications

  • Cross-Site Layouts

  • Cloud Access (Desktop/Mobile Clients)

Customer Impact:

  • Desktop / Mobile Clients being unexpectedly logged out

  • Some customers cannot connect to their Sites

  • Certain API requests are experiencing high latency and increased failure rates

Root Cause

During a scheduled maintenance window on the evening of April 8, an upgrade of the Cloud DB service to a new version resulted in service instability. When traffic was routed to the upgraded service, it became unavailable, causing approximately 8 minutes of complete service disruption for Cloud DB API operations. The issue was resolved by reverting traffic to the previous stable version, which had remained running throughout the maintenance.

A software defect in the new version caused the service to crash when handling production traffic. The defect was in a new congestion control feature that did not properly handle a specific connection state, resulting in a crash under load. The issue did not manifest during the warmup period because the code path is only triggered under active traffic. The root cause has been identified and a fix is in progress.

How We Fixed It

Traffic was reverted to the previous stable version, which had been running continuously and required no recovery time. Service was fully restored within minutes of the revert.

Corrective Actions

Short Term

  • Fix the identified software defect and deploy the corrected version

  • Build and integrate a Cloud DB routing switch and failover tool to enable rapid traffic cutover and rollback between active service versions. The goal is to reduce rollback time from several minutes to seconds and minimize customer impact during failed Cloud DB deployments