SiQ Detailed Root Cause Analysis – Severity 1 – 10/17/2023
Inability to Access SpaceIQ
We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Description:
On October 17, 2023, at approximately 3am EDT, customer support began receiving reports and the inability to access the SpaceIQ platform. When attempting to access the platform, customers were presented with a 503 Service Unavailable error. This caused disruption to user experience.
Type of Event:
Severity 1 Outage
Services\Modules Impacted:
Production
Remediation:
The data base migration for deployment of October 11, 2023 release was rolled back to correct the incident.
Timeline:
October 17, 2023 at approximately 3am EDT - Support began to receive reports from customers and the inability to access the SpaceIQ platform. When attempting to access the platform, customers were presented with a 503 Service Unavailable error. No update to the status page were made due to the time of reported issue which were outside of regular support hours (Monday thru Friday, 6am – 5pm MST). After a short investigation, we learned that the release caused the disruption. Engineering quickly identified the issue and decision is made to roll the release back. The impact of this disruption last approximately 1 hour and 15 minutes.
Total Duration of Event:
1hr 15mins
Root Cause Analysis:
The October 11th release was rolled out, 10/16/2023 11:30 pm. The Backend Data Base was deployed at the same time Front End Data Base was. Backend took longer than expected and frontend deployment finished first and began calling for backend data during rollout. Usually, they finish around the same time. This caused the 503 errors that customers experienced.
Preventative Action:
Internal teams will collaborate to ensure that maintenance windows are scheduled when we feel there will be potential down time due to size of deployment. Also discussing the move for deployment of releases to ensure that there is support coverage available. Support will also be launching a new 24x5 Incident Management Process that is set to go live in the coming weeks, to ensure this does not happen again. This will cover off for Sev 1 and Sev 2 incidents. All other incidents will go through the normal channels/process for turnaround timeframes (Sev 3 – Sev 5).