[incident] 2025-09-16 - Ongoing Cloud Connectivity Service Degradation (Resolved)
Incident Summary
Date: Sep 16, 2025
Time: TBA
Service Impacted: Connectivity to Sites
Severity: Medium
Total Duration: TBA
Customer Impact:
Desktop / Mobile Clients being unexpectedly logged out
Some customers cannot connect to their Sites
Certain API requests experienced high latency and increased failure rates
Details
Sep 18, 2025
Yesterday evening we upgraded the database tier to increase throughput and networking headroom. However, early peak-traffic signals this morning indicate that the upgrade alone did not remove the bottleneck.
We have identified issues in the Cloud DB service layer that are contributing to latency and errors. Our engineering team is implementing fixes, and we plan to roll out an update this week. We will share the maintenance window as soon as it is set.
In the meantime, we are closely monitoring both RDS performance and Cloud DB service latency as traffic ramps up.
Sep 17, 2025
We’re seeing elevated latency and intermittent errors on Cloud DB. Recently, we upgraded the Cloud DB service to a new instance family with higher packet-per-second (PPS) capacity to address network bottlenecks. While this change improved some network headroom, we continue to observe elevated latency on Cloud DB Service. During peak periods, customers may experience elevated latency, intermittent timeouts or retry prompts, and generally slower API responses. Off-peak hours remain close to normal.
Internal analysis shows the database is hitting network performance ceilings that impact request latency and reliability. Based on our own monitoring and confirmation from AWS engineers, the current bottleneck appears to be at the Cloud DB’s RDS Database layer, not the EC2 hosting an application. We have already escalated the issue with AWS and are coordinating closely on recommended paths forward.
We are scheduling an emergency change to upgrade the Database to a different instance family type with a significantly higher PPS threshold and throughput capacity. This will increase overall resilience under heavy connection surges and reduce the likelihood of packet loss or request timeouts b/w the Cloud DB service and the Cloud DB Database.