Incident Overview
Date & Time:
Start: 2024-11-26, 06:40 PM AEST
Resolved: 2024-11-27, 2:00 PM AEST
Duration: 19 hours, 20 minutes
Service Impacted: Connection time to systems routed through Melbourne 1 Relay Server (~15% of the Australian users) was increased (up to 2 minutes)
Severity: Low
Customer Impact: Small percent of users experienced longer connection times.
Action Required: Update Firewall Passlist configurations and monitoring endpoints according to the https://support.networkoptix.com/hc/en-us/articles/360010795813-Firewall-Passlist article.
Investigation Log Timeline (AEST)
2024-11-26 2:19 PM: We started getting complaints about the connection time that was significantly increased in Oceania.
2024-11-26 7:07 PM: We narrowed the issue down to the Melbourne 1 Relay Server and started investigation.
2024-11-26 10:50 PM: We noticed significant service degradation at 6:50 PM on Melbourne 1 Relay Server vultr-mel-2.vmsproxy.com (67.219.103.112)
.
2024-11-27 1:02 AM: We identified the issue and started working on the solution.
2024-11-27 2:00 PM:
We deployed TWO new Relay Servers in the Asia-Pacific region:
Sydney, Australia 1
relay-au-syd-1-prod-dp.vmsproxy.com
(95.173.193.212)Sydney, Australia 2
relay-au-syd-2-prod-dp.vmsproxy.com
(95.173.193.213)
We disabled TWO existing Relay Servers in the Asia-Pacific region:
Sydney 4, Australia
vultr-syd-4.vmsproxy.com
(45.77.51.96)Melbourne 1, Australia
vultr-mel-2.vmsproxy.com
(67.219.103.112)
We restarted the Connection Mediator in that area to re-route traffic to the new Relay Servers.
2024-11-27 2:10 PM: The performance increase is confirmed. The https://support.networkoptix.com/hc/en-us/articles/360010795813-Firewall-Passlist article has been updated.
Root Cause
The incident is caused by high load on the Melbourne 1 Relay Server.
Corrective Actions
Relay Servers Update
The issue will be fixed once all Relay Servers will be updated in all regionsEnhanced Monitoring
The monitoring system will be updated to track issues associated with the Relay Servers performance.Required Actions
Update Firewall Passlist configurations and monitoring endpoints according to the https://support.networkoptix.com/hc/en-us/articles/360010795813-Firewall-Passlist article.
UPDATE OVERVIEW
The Relay Servers will be added in each region one by one. Then the traffic will be re-routed to the new Relay Servers.
Please add the new IP addresses and FQDNs to your firewall configurations and monitoring endpoints:
Once the traffic is fully re-routed, old Relay Servers will be turned off (see Schedule below).
See https://support.networkoptix.com/hc/en-us/articles/360010795813-Firewall-Passlist for updated IP addresses.
RELEASE NOTES / SCOPE OF WORK
Update OS to Ubuntu 24.04.
Add the Coturn service (WebRTC-specific traffic relay in Nx WebRTC infrastructure).
Enhance monitoring of AWS hosts and docker_containers.
Update versions of all docker containers and optimize their Docker files.
Unify all Relay Servers settings (except domain names).
Adjust RAM and swap files limitations for all docker containers on Relay Servers.
SCHEDULE
Thursday, Dec 19, 2024 - updating Relay Servers by regions and re-routing traffic:
6:30 AM PST - North America
7:30 AM PST - Asia-Pacific (Friday, Dec 20, 2:30 AM Sydney Time)
8:30 AM PST - Europe (Thursday, Dec 19, 5:30 PM Europe time)
9:30 AM PST - Deprecating old relays via API
9:30 AM PST - Testing
Monday, Dec 23, 2024:
5:30 PM PST - Confirming that the traffic is fully migrated from old Relay Servers
5:40 PM PST - Disabling old Relay Servers
Wednesday, Dec 25, 2024:
4:00 AM PST - Turning off old Relay Servers
DOWNTIME
None
MAINTENANCE LOG
Dec 19th (PST):
6:30 AM - North America update started
7:25 AM - North America update completed
7:30 AM - Asia-Pacific update started
8:25 AM - Asia-Pacific update completed
8:30 AM - Europe update started
9:25 AM - Europe update completed
9:30 AM - Deprecating old relays via API started
9:40 AM - Deprecating old relays via API completed
9:40 AM - Testing started
10:15 AM - Testing successfully ended.
RELEASE NOTES
During the last update, we had to roll back Doc DB (it is responsible for the Cross system layouts functionality) because of the internal issues discovered during the final test round. The issues were fixed.
DOWNTIME
No downtime.
MAINTENANCE LOG
Nov 19th:
17:20 - Preparation started
17:40 - DocDB update started
17:43 - DocDB update completed
17:43 - Testing started
18:17 - Testing ended
2024-11-13 12:29 AM: We are currently experiencing issues with the Amsterdam 1 Relay Server relay-dp-ams-1.vmsproxy.com (89.187.174.241).
We do not expect service degradation from customers point of view. Other Relays are handling the traffic.
The issue is related to the DataPacket hosting server. We are waiting to get the replacement server from DataPacket.
We will send another update once the replacement server is back online.
2024-11-13 11:07 AM: New server has been provided.
IMPORTANT: New IP and Hostname: https://relay-nl-ams-1-prod-dp.vmsproxy.com (79.127.227.187). Please update your firewall settings and monitoring endpoints.
We update you once the server is online.
2024-11-13 11:25 AM: New server is up and running. The https://support.networkoptix.com/hc/en-us/articles/360010795813-Firewall-Passlist article has been updated. Please update your firewall settings and monitoring endpoints.
RELEASE NOTES
IMPROVEMENTS
Added support for the upcoming Mobile Client 25.1.
Improved load balancing logic for Connection Mediators.
Improved the logic of selecting the most suitable Relay Server.
BUG FIXES
Cross-system layouts did not show up in the Desktop Client if a user specified capital letters in the Cloud Email address. Fixed.
DOWNTIME
No downtime. Current connections MAY be affected for the VMS versions less than 5.1.3. Users might need to log back in to their systems.
MAINTENANCE LOG
Nov 14th:
17:00 - Preparation started
17:40 - DocDB and US-EAST1 Mediator update started
17:50 - DocDB and US-EAST1 Mediator update completed
17:52 - Starting Update for the rest of mediators (region by region)
18.29 - Mediators update completed
18:30 - Testing started
18:55 - Testing ended
18:55 - Bug is found with DocDB
18:55 - Manual Testing
19:10 - Call to RollBack DocDB Service
19:12 - Rollback changes applied
19:19 - Rollback for DocDB completed
Release Notes
BUG FIXES:
Streams from cameras in 5.1.x systems could not be displayed on Cloud Portal (View tab). Fixed.
Internal fixes for the upcoming Channel Partners feature.
DOWNTIME
There might be up to 4 minutes downtime of the following services:
Cloud Portal
Email Notifications
Push Notifications.
Сloud connectivity will not be affected.
MAINTENANCE LOG
17.20 - Preparation started
17.40 - Cloud Portal update started
17:56 - 17:57 - Cloud Portal and Push Notifications unavailable (downtime)
18.06 - Cloud Portal update completed
18:07 - Testing started
18:18 - Testing successfully ended
Incident Overview:
Date & Time:
Start: 2024-11-01, 00:00 AM
Resolved: 2024-11-05, 12:40 PM
Duration: 4 days, 12 hours, 40 minutes
Service Impacted: Emails sent from VMS (partially)
Severity: Medium
Customer Impact: Emails originated from VMS via Cloud SMTP servers (not AWS) were sent with significant delays. Emails originated from Oct 31 00:00 AM to November 1 18:00 PM were lost. Emails originated from Cloud Portal were not affected.
Investigation Log
2024-11-01 12:45 PM: We started getting complaints about Emails from Cloud Portal not going through. We started trying to investigate and reproduce the issue.
2024-11-04: The issue was reproduced and escalated to the Cloud team for further investigation.
2024-11-05 9:18 AM: The issue was narrowed down to the systems using the new feature that allows sending Emails through Cloud SMTP server.
2024-11-05 11:30 AM: The issue was fixed on the infrastructure side by restarting the Email service and then the Email queue started processing.
2024-11-05 12:40 PM: The Emails queue was fully processed, however certain Emails originated from Oct 31 00:00 AM to November 1 18:00 PM were lost.
Root Cause
The incident is likely related to the latest Cloud Portal update. The exact root cause is still unknown.
Corrective Actions
Enhanced Monitoring
The monitoring system will be updated to track issues associated with the Emails originated from VMS and processed through Cloud SMTP Servers.Further Investigation
We will be looking at the Email queue and monitor the performance. Enhanced Monitoring will help us to investigate the root cause and prepare the hotfix
Summary
We identified the ongoing issue with the Cloud Portal - streams from cameras cannot be displayed on 5.1.x systems.
Investigation Log
2024-10-31: After the recent Cloud Portal update, we started getting various reports about the issues with playing back streams from cameras on Cloud Portal.
2024-11-1: After closer investigation, we realized that the issues occur on some 5.1.x systems only.
2024-11-4: We confirmed that the issue is affecting all customers on 5.1.x systems and that 6.0 systems are working fine.
Corrective Actions
We are preparing the hotfix to be deployed within a few days. We will send the update and schedule the hotfix once we are done with the testing.
We will follow up with the Post-Mortem / Root Cause Analysis