Skip to end of banner
Go to start of banner

[incident + post-mortem] 2024-11-01 00:00 AM - 2024-11-05 12:00 PM Partial Email Service Degradation

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Version History

Version 1 Next »

Incident Overview:

  • Date & Time:

    • Start: 2024-11-01, 00:00 AM

    • Resolved: 2024-11-05, 12:40 PM

  • Duration: 4 days, 12 hours, 40 minutes

  • Service Impacted: Emails sent from VMS (partially)

  • Severity: Medium

  • Customer Impact: Emails originated from VMS via Cloud SMTP servers (not AWS) were sent with significant delays. Emails originated from Oct 31 00:00 AM to November 1 18:00 PM were lost. Emails originated from Cloud Portal were not affected.

Investigation Log

  • 2024-11-01 12:45 PM: We started getting complaints about Emails from Cloud Portal not going through. We started trying to investigate and reproduce the issue.

  • 2024-11-04: The issue was reproduced escalated to the Cloud team for further investigation.

  • 2024-11-05 9:18 AM: The issue was narrowed down to the systems using the new feature that allows sending Emails through Cloud SMTP server.

  • 2024-11-05 11:30 AM: The issue was fixed on the infrastructure side by restarting the Email service and Emails started processing.

  • 2024-11-05 12:40 PM: The Emails queue was fully processed, however certain Emails originated from Oct 31 00:00 AM to November 1 18:00 PM were lost.

Root Cause

The incident is likely related to the latest Cloud Portal update. The exact root cause is still unknown.

Corrective Actions

  1. Enhanced Monitoring
    The monitoring system will be updated to track issues associated with the Emails originated from VMS and processed through Cloud SMTP Servers.

  2. Further Investigation
    We will be looking at the Email queue and monitor the performance. Enhanced Monitoring will help us to investigate the root cause and prepare the hotfix

  • No labels