[post-mortem] 2026-03-30 - Cloud DB Service Disruption

[post-mortem] 2026-03-30 - Cloud DB Service Disruption

Incident Overview

Date: 2026-03-30
Time (PDT): 12:08 PM - 1:30 PM (19:08 - 20:30 UTC)
Duration: 72 minutes (corrected)
Severity: High

Duration breakdown:

  • Complete outage: ~2 minutes

  • Significant degradation: ~20 minutes

  • Intermittent errors: ~50 minutes

Services Impacted:

  • Cloud DB (primary)

  • Cloud Portal

  • Cloud Connectivity (Connection Mediator, VMS Gateway)

Customer Impact:

  • Cloud-connected systems experienced a brief complete outage (~2 minutes) followed by degraded service (~20 minutes) and intermittent errors (~50 minutes) during peak US business hours.

  • Users attempting to access Cloud Portal, authenticate, or connect to systems through cloud experienced failures during the outage and degradation periods.

  • Already-established direct connections between clients and servers were not affected.

  • The majority of service was restored within approximately 20 minutes, with full recovery at approximately 1:30 PM PDT (20:30 UTC).

Root Cause

The Cloud DB service experienced an unexpected application crash caused by a software defect in its network I/O handling. The service automatically restarted within seconds, but required additional time to fully reload its operational data before all requests could be served successfully.

This crash is part of a recurring pattern that the backend team is addressing.

How We Fixed It

The service recovered automatically - the container orchestration platform detected the crash and launched a replacement within 26 seconds. No manual intervention was required to restore service.

The extended recovery time is due to the service's architecture, which requires loading a large dataset into memory on startup before it can serve all request types.

Corrective Actions

Already Implemented

  • Collected comprehensive diagnostic data from the crash for the development team to analyze.

Short Term

  • Recover and analyze existing crash diagnostic data to identify the specific software defect.

Long Term

  • Development team is actively rewriting the core of this service to improve resilience and reduce recovery time.