[incident + post-mortem] 2024-11-01 00:00 AM - 2024-11-05 12:00 PM Partial Email Service Degradation
Incident Overview:
Date & Time:
Start: 2024-11-01, 00:00 AM
Resolved: 2024-11-05, 12:40 PM
Duration: 4 days, 12 hours, 40 minutes
Service Impacted: Emails sent from VMS (partially)
Severity: Medium
Customer Impact: Emails originated from VMS via Cloud SMTP servers (not AWS) were sent with significant delays. Emails originated from Oct 31 00:00 AM to November 1 18:00 PM were lost. Emails originated from Cloud Portal were not affected.
Investigation Log
2024-11-01 12:45 PM: We started getting complaints about Emails from Cloud Portal not going through. We started trying to investigate and reproduce the issue.
2024-11-04: The issue was reproduced and escalated to the Cloud team for further investigation.
2024-11-05 9:18 AM: The issue was narrowed down to the systems using the new feature that allows sending Emails through Cloud SMTP server.
2024-11-05 11:30 AM: The issue was fixed on the infrastructure side by restarting the Email service and then the Email queue started processing.
2024-11-05 12:40 PM: The Emails queue was fully processed, however certain Emails originated from Oct 31 00:00 AM to November 1 18:00 PM were lost.
Root Cause
The incident is likely related to the latest Cloud Portal update. The exact root cause is still unknown.
Corrective Actions
Enhanced Monitoring
The monitoring system will be updated to track issues associated with the Emails originated from VMS and processed through Cloud SMTP Servers.Further Investigation
We will be looking at the Email queue and monitor the performance. Enhanced Monitoring will help us to investigate the root cause and prepare the hotfix