[post-mortem] 2026-04-08 - Cloud DB Upgrade Failure and Service Disruption
Incident Overview
Date: 2026-04-08
Time (PDT): 5:11 PM - 5:39 PM (00:11 - 00:39 UTC)
Duration: 28 minutes
Severity: High
Duration breakdown:
Degradation: ~20 minutes
Service Disruption: ~8 minutes
Services Impacted:
Cloud Portal
Email Notifications
Push Notifications
Cross-Site Layouts
Cloud Access (Desktop/Mobile Clients)
Customer Impact:
Desktop / Mobile Clients being unexpectedly logged out
Some customers cannot connect to their Sites
Certain API requests are experiencing high latency and increased failure rates
Root Cause
During a scheduled maintenance window on the evening of April 8, an upgrade of the Cloud DB service to a new version resulted in service instability. When traffic was routed to the upgraded service, it became unavailable, causing approximately 8 minutes of complete service disruption for Cloud DB API operations. The issue was resolved by reverting traffic to the previous stable version, which had remained running throughout the maintenance.
A software defect in the new version caused the service to crash when handling production traffic. The defect was in a new congestion control feature that did not properly handle a specific connection state, resulting in a crash under load. The issue did not manifest during the warmup period because the code path is only triggered under active traffic. The root cause has been identified and a fix is in progress.
How We Fixed It
Traffic was reverted to the previous stable version, which had been running continuously and required no recovery time. Service was fully restored within minutes of the revert.
Corrective Actions
Short Term
Fix the identified software defect and deploy the corrected version
Build and integrate a Cloud DB routing switch and failover tool to enable rapid traffic cutover and rollback between active service versions. The goal is to reduce rollback time from several minutes to seconds and minimize customer impact during failed Cloud DB deployments