[incident] 2025-10-20 07:00 PST Cloud Connectivity Discovery Service Outage (Resolved)
Incident Summary
Date: Oct 20, 2025
Time: 2:20 AM PST - 12:40 PM PST
Service Impacted: Discovery Service
Severity: High
Total Duration: TBA
Customer Impact:
New systems will not be able to connect as endpoints are not discoverable.
Existing connections are not impacted.
Users may experience intermittent offline status or random disconnections.
An upstream AWS event “Operational issue - Multiple services (N. Virginia)“ caused elevated error rates and DNS resolution issues across multiple services (incl. DynamoDB/SQS/EC2). This impacted our Discovery Service, which depends on these AWS Services. Our Site Reliability Engineering team is working in the investigation and will be providing the updates here.
Incident Timeline (PST)
Oct 19, 11:54 PM
First indication of the AWS outage was a WRT alert that came in for speedtest service at 11:54 pm PST. Followed by a couple push notification queue alerts 5 minutes later.
Oct 20, 12:54 AM
First portal alerts came in. Some portals were impacted from about 12:55 am to 1:33 am PST. Push and Notification services currently have a higher than normal WRT, but we have no queue alerts,
Oct 20, 02:20 AM
Discovery service impact begins.
Oct 20, 05:48 AM
AWS reported on progress resolving the issue with new EC2 instance launches in the US-EAST-1 Region and are now able to successfully launch new instances in some Availability Zones.
Oct 20, 07:11 AM
We tried to restart the underlying EC2 instance for Discovery Service. This attempt failed. AWS reports that there are still elevated errors for launching new EC2 instances.
Oct 20, 08:30 AM
AWS continues to investigate the root cause for the network connectivity issues. Discovery service is still impacted.
Oct 20, 09:40 AM
We managed to launch Discovery Service. We’re monitoring closely and validating endpoint.
Oct 20, 09:47 AM
Discovery service health checks restored. Confirming the status of other affected services.
Oct 20, 11:45 AM
All critical cloud services are looking healthy and responding normally. Email Notifications are currently delayed, due to timeouts and errors communicating to AWS services.
Oct 20, 12:40 PM
Email notifications have fully caught up. We were able to launch additional workers to drain the backlog, and the queue is clear. On our side, all alerts are closed, and live monitoring shows no issues. Discovery remains healthy for both new and existing connections. We’ll keep watching, but at this point everything looks normal.