Skip to end of banner
Go to start of banner

[incident + post-mortem] 2024-11-26 06:40 PM - 2024-11-27 02:00 PM (Australia EST) - Relay Server Performance Degradation (Melbourne 1, Australia)

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Version History

« Previous Version 4 Current »

Incident Overview

  • Date & Time:

    • Start: 2024-11-26, 06:40 PM AEST

    • Resolved: 2024-11-27, 2:00 PM AEST

  • Duration: 19 hours, 20 minutes

  • Service Impacted: Connection time to systems routed through Melbourne 1 Relay Server (~15% of the Australian users) was increased (up to 2 minutes)

  • Severity: Low

  • Customer Impact: Small percent of users experienced longer connection times.

  • Action Required: Update Firewall Passlist configurations and monitoring endpoints according to the https://support.networkoptix.com/hc/en-us/articles/360010795813-Firewall-Passlist article.

Investigation Log Timeline (AEST)

2024-11-25 2:19 PM: We started getting complaints about the connection time that was significantly increased in Oceania.

2024-11-25 7:07 PM: We narrowed the issue down to the Melbourne 1 Relay Server and started investigation.

2024-11-26 10:50 PM: We noticed significant service degradation at 6:50 PM on Melbourne 1 Relay Server vultr-mel-2.vmsproxy.com (67.219.103.112).

2024-11-27 1:02 AM: We identified the issue and started working on the solution.

2024-11-27 2:00 PM:

  • We deployed TWO new Relay Servers in the Asia-Pacific region:

    • Sydney, Australia 1 relay-au-syd-1-prod-dp.vmsproxy.com (95.173.193.212)

    • Sydney, Australia 2 relay-au-syd-2-prod-dp.vmsproxy.com (95.173.193.213)

  • We disabled TWO existing Relay Servers in the Asia-Pacific region:

    • Sydney 4, Australia vultr-syd-4.vmsproxy.com (45.77.51.96)

    • Melbourne 1, Australia vultr-mel-2.vmsproxy.com (67.219.103.112)

  • We restarted the Connection Mediator in that area to re-route traffic to the new Relay Servers.

2024-11-27 2:10 PM: The performance increase is confirmed. The https://support.networkoptix.com/hc/en-us/articles/360010795813-Firewall-Passlist article has been updated.

Root Cause

The incident is caused by high load on the Melbourne 1 Relay Server.

Corrective Actions

  1. Relay Servers Update
    The issue will be fixed once all Relay Servers will be updated in all regions

  2. Enhanced Monitoring
    The monitoring system will be updated to track issues associated with the Relay Servers performance.

  3. Required Actions
    Update Firewall Passlist configurations and monitoring endpoints according to the https://support.networkoptix.com/hc/en-us/articles/360010795813-Firewall-Passlist article.

  • No labels