[post-mortem] 2025-09-16 - 2025-10-06 Cloud Connectivity Service Degradation
Incident Overview
Date & Time (PST):
September 16, 2025 — 11:00 AM to 3:00 PM (4 hours)
September 17, 2025 — 11:00 AM to 3:00 PM (4 hours)
September 18, 2025 — 11:00 AM to 3:00 PM (4 hours)
September 23, 2025 — 11:00 AM to 3:00 PM (4 hours)
September 24, 2025 — 11:00 AM to 3:00 PM (4 hours)
September 25, 2025 — 11:00 AM to 3:00 PM (4 hours)
September 30, 2025 — 9:35 AM to 4:00 PM (6h 25m)
October 1, 2025 — 11:00 AM to 3:00 PM (4 hours)
October 2, 2025 — 11:00 AM to 3:00 PM (4 hours)
October 4, 2025 — 6:00 PM to 7:06 PM (1h 06m)
October 6, 2025 — 6:30 PM to 6:40 PM (10m)
Severity: High
Services Impacted:
Cloud Portal
Email Notifications
Push Notifications
Cross-Site Layouts
Cloud Access (Desktop/Mobile Clients)
Customer Impact:
Desktop / Mobile Clients being unexpectedly logged out
Some customers cannot connect to their Sites
Certain API requests are experiencing high latency and increased failure rates
Incidents & emergency updates:
[incident] 2025-09-16 - Ongoing Cloud Connectivity Service Degradation
Root Cause
Under peak traffic, multiple threads in Cloud DB accessed the shared cache at the same time, leading to lock waits and thread stalls. This condition led to increased latency.
When the server size was increased, more CPU/threads competed for the same cache. Latency rose as new threads were blocked.
Cross-service connectivity between Cloud DB and Channel Partners experienced connection timeouts under load, which caused retries and increased overall load.
How We Fixed It
Introduced cache splitting to minimize contention and reduce the probability of threads competing for the same lock.
After caching was split we increased threads to improve throughput.
Deployed a connection pool and in-memory reuse for Channel Partners calls.
Increased targeted timeouts to avoid cascading failures.
Corrective Actions
Already Implemented
Added new monitoring for Cloud DB to Channel Partners connection metrics.
Added new tooling/scripts to monitor additional Host Metrics: CPU, load/run-queue, memory pressure, disk I/O wait.
Added thresholds and alerts to detect resource contention.
Long Term
Redesign and improve overall Cloud DB architecture.
Continue optimizing Cloud DB services to better handle peak loads.