[post-mortem] 2025-10-20 07:00 PST Cloud Connectivity Discovery Service Outage

[post-mortem] 2025-10-20 07:00 PST Cloud Connectivity Discovery Service Outage

Incident Overview

Date: Oct 20, 2025
Time: 2:20 AM PST - 09:47 AM PST
Service Impacted: Discovery Service
Severity: High
Total Duration: 7 hours and 27 minutes

Customer Impact:

New devices/systems could not establish connections (endpoints not discoverable). Some users observed intermittent “offline” status or brief disconnects while clients retried. Email notifications were delayed but ultimately delivered once queues cleared.

Root Cause

An upstream AWS event “Operational issue - Multiple services (N. Virginia)“ affected EC2 (servers running Discovery Service), DynamoDB (device registration/lookup), and SQS/Lambda (email job queue/processors). Discovery was unavailable for new connections from ~2:20 AM PST to 9:40 AM PST. Full health checks were restored at 9:47 AM PST. SQS/Lambda slowdowns also delayed email notifications until 12:40 PM PST.

For the complete timeline, please see our Incident Page.

The AWS incident impacted our environment in the following ways:

  1. Compute / EC2 launches.

The EC2 instances became unresponsive and fell out of the cluster. Attempts to restart or replace them failed. New EC2 instances could not launch due to AWS throttling and launch errors.

  1. DynamoDB DNS resolution issues.

DynamoDB, which we use for device registration, experienced errors and DNS resolution issues. As a result, new-connection workflows remained unstable even after partial capacity returned.

  1. SQS polling slowdown

SQS processing slowed due to AWS throttling and polling delays, causing email notification delivery delays (timeouts and intermittent upstream errors). The backlog fully drained once AWS SQS normalized.

How We Fixed It

  • After multiple unsuccessful attempts to launch replacement EC2 instances, where we received AWS rate-limiting and capacity errors across several Availability Zones, we pivoted to a workaround. We modified our deployment code on the fly to redeploy the Discovery Service onto a healthy, already-running EC2 host.

  • We updated configuration and permissions, then shifted traffic. This saved several hours and restored service well before AWS fully recovered. New connections were restored at 9:40 AM PST, and our monitoring showed full recovery with all health checks green at 9:47 AM PST

  • To address the email notification delays, we managed to launch additional workers to process the SQS backlog. Delivery fully recovered by 12:40 PM PST, and queues have remained healthy since.

Corrective Actions

Short Term:

(Planned) Add an external cache (e.g., MongoDB outside AWS) so discovery lookups continue if AWS services degrade.

Long Term:

(Planned) Set up a hot-swap standby for Discovery Service in a different region, kept in sync for quick failover.