[post-mortem] 2025-09-16 - 2025-10-06 Cloud Connectivity Service Degradation

[post-mortem] 2025-09-16 - 2025-10-06 Cloud Connectivity Service Degradation

Incident Overview

Date & Time (PST):

  • September 16, 2025 — 11:00 AM to 3:00 PM (4 hours)

  • September 17, 2025 — 11:00 AM to 3:00 PM (4 hours)

  • September 18, 2025 — 11:00 AM to 3:00 PM (4 hours)

  • September 23, 2025 — 11:00 AM to 3:00 PM (4 hours)

  • September 24, 2025 — 11:00 AM to 3:00 PM (4 hours)

  • September 25, 2025 — 11:00 AM to 3:00 PM (4 hours)

  • September 30, 2025 — 9:35 AM to 4:00 PM (6h 25m)

  • October 1, 2025 — 11:00 AM to 3:00 PM (4 hours)

  • October 2, 2025 — 11:00 AM to 3:00 PM (4 hours)

  • October 4, 2025 — 6:00 PM to 7:06 PM (1h 06m)

  • October 6, 2025 — 6:30 PM to 6:40 PM (10m)

Severity: High
Services Impacted:

  • Cloud Portal

  • Email Notifications

  • Push Notifications

  • Cross-Site Layouts

  • Cloud Access (Desktop/Mobile Clients)

Customer Impact:

  • Desktop / Mobile Clients being unexpectedly logged out

  • Some customers cannot connect to their Sites

  • Certain API requests are experiencing high latency and increased failure rates

Incidents & emergency updates:

Root Cause

  • Under peak traffic, multiple threads in Cloud DB accessed the shared cache at the same time, leading to lock waits and thread stalls. This condition led to increased latency.

  • When the server size was increased, more CPU/threads competed for the same cache. Latency rose as new threads were blocked.

  • Cross-service connectivity between Cloud DB and Channel Partners experienced connection timeouts under load, which caused retries and increased overall load.

How We Fixed It

  • Introduced cache splitting to minimize contention and reduce the probability of threads competing for the same lock.

  • After caching was split we increased threads to improve throughput.

  • Deployed a connection pool and in-memory reuse for Channel Partners calls.

  • Increased targeted timeouts to avoid cascading failures.

Corrective Actions

Already Implemented

  • Added new monitoring for Cloud DB to Channel Partners connection metrics.

  • Added new tooling/scripts to monitor additional Host Metrics: CPU, load/run-queue, memory pressure, disk I/O wait.

  • Added thresholds and alerts to detect resource contention.

Long Term

  • Redesign and improve overall Cloud DB architecture.

  • Continue optimizing Cloud DB services to better handle peak loads.