S1 - Inability to Access SpaceIQ
Incident Report for SiQ
Postmortem

SiQ Detailed Root Cause Analysis – Severity 1 – 10/17/2023 

Inability to Access SpaceIQ  

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident. 

 

Description: 

On October 17, 2023, at approximately 3am EDT, customer support began receiving reports and the inability to access the SpaceIQ platform. When attempting to access the platform, customers were presented with a 503 Service Unavailable error. This caused disruption to user experience.    

Type of Event: 

Severity 1 Outage 

 

Services\Modules Impacted: 

Production 

 

Remediation: 

The data base migration for deployment of October 11, 2023 release was rolled back to correct the incident. 

 

Timeline: 

October 17, 2023 at approximately 3am EDT - Support began to receive reports from customers and the inability to access the SpaceIQ platform. When attempting to access the platform, customers were presented with a 503 Service Unavailable error. No update to the status page were made due to the time of reported issue which were outside of regular support hours (Monday thru Friday, 6am – 5pm MST). After a short investigation, we learned that the release caused the disruption. Engineering quickly identified the issue and decision is made to roll the release back. The impact of this disruption last approximately 1 hour and 15 minutes.  

 

 Total Duration of Event: 

1hr 15mins 

 

Root Cause Analysis: 

 The October 11th release was rolled out, 10/16/2023 11:30 pm. The Backend Data Base was deployed at the same time Front End Data Base was. Backend took longer than expected and frontend deployment finished first and began calling for backend data during rollout. Usually, they finish around the same time. This caused the 503 errors that customers experienced. 

 

Preventative Action:  

 Internal teams will collaborate to ensure that maintenance windows are scheduled when we feel there will be potential down time due to size of deployment. Also discussing the move for deployment of releases to ensure that there is support coverage available. Support will also be launching a new 24x5 Incident Management Process that is set to go live in the coming weeks, to ensure this does not happen again.  This will cover off for Sev 1 and Sev 2 incidents.  All other incidents will go through the normal channels/process for turnaround timeframes (Sev 3 – Sev 5).

Posted Oct 27, 2023 - 19:49 UTC

Resolved
On October 17, 2023, at approximately 3am EDT, customer support began receiving reports and the inability to access the SpaceIQ platform. When attempting to access the platform, customers were presented with a 503 Service Unavailable error. This caused disruption to user experience.
Posted Oct 17, 2023 - 09:00 UTC