[post-mortem] 2025-11-14 4:10 PM Discovery Service outage

[post-mortem] 2025-11-14 4:10 PM Discovery Service outage

Incident Overview

Date & Time (PST): 2025-11-14 4:10-4:40 PM PST (11:10-11:40 PM UTC)

Service Degradation Duration: 30 minutes
Severity: High
Details: [incident] 2025-11-14 - Discovery Service outage.

Services Impact:

  • Discovery Service

Customer Impact:

  • Users could not connect to their Sites

  • Already connected users were not affected.

Root Cause

ECS agent failed to start on the EC2 when the Discovery service was moved back to its own EC2 instance.

How We Fixed It

The ECS agent on the new EC2 instance created was started.

Corrective Actions

A new script needs to be introduced into image user-data scripts that will check every minute for ECS agent running, and start it if it is not currently running.