[post-mortem] 2026-03-30 - Cloud DB Service Disruption
Incident Overview
Date: 2026-03-30
Time (PDT): 12:08 PM - 1:30 PM (19:08 - 20:30 UTC)
Duration: 72 minutes (corrected)
Severity: High
Duration breakdown:
Complete outage: ~2 minutes
Significant degradation: ~20 minutes
Intermittent errors: ~50 minutes
Services Impacted:
Cloud DB (primary)
Cloud Portal
Cloud Connectivity (Connection Mediator, VMS Gateway)
Customer Impact:
Cloud-connected systems experienced a brief complete outage (~2 minutes) followed by degraded service (~20 minutes) and intermittent errors (~50 minutes) during peak US business hours.
Users attempting to access Cloud Portal, authenticate, or connect to systems through cloud experienced failures during the outage and degradation periods.
Already-established direct connections between clients and servers were not affected.
The majority of service was restored within approximately 20 minutes, with full recovery at approximately 1:30 PM PDT (20:30 UTC).
Root Cause
The Cloud DB service experienced an unexpected application crash caused by a software defect in its network I/O handling. The service automatically restarted within seconds, but required additional time to fully reload its operational data before all requests could be served successfully.
This crash is part of a recurring pattern that the backend team is addressing.
How We Fixed It
The service recovered automatically - the container orchestration platform detected the crash and launched a replacement within 26 seconds. No manual intervention was required to restore service.
The extended recovery time is due to the service's architecture, which requires loading a large dataset into memory on startup before it can serve all request types.
Corrective Actions
Already Implemented
Collected comprehensive diagnostic data from the crash for the development team to analyze.
Short Term
Recover and analyze existing crash diagnostic data to identify the specific software defect.
Long Term
Development team is actively rewriting the core of this service to improve resilience and reduce recovery time.