S1 - Inability to Access SpaceIQ
Incident Report for SiQ
Postmortem

SiQ Detailed Root Cause Analysis – Severity 1 – November 3rd, 6th, & 7th 

Inability to Access the SpaceIQ Platform 

 

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident. 

 

Description: 

On Friday, November 3, 2023, internal and external customers began to report the inability to access the SpaceIQ platform. When users try to login, they are presented with an error message, “Web server is down”.  

 

Type of Event: 

Severity 1 - Outage 

 

Services\Modules Impacted: 

Production  

 

Remediation: 

New node group with a higher distribution on on-demand instances was configured to increase the availability of the underlying infrastructure powering the application. Services were then restarted to correct the disruption. 

 

Timeline: 

On Friday, November 3, 2023, at approximately 12:18 EST an initial report from a customer mentioned their inability to access the SpaceIQ platform. When their users try to login, they are presented with an error message, “Web server is down”. After thorough investigation additional customers began to report and by 1:25pm EST all customers were notified of the Severity 1 disruption via status page. By 2 pm EST internal teams identified the root cause and began to research the fix for the disruption. 2:16pm EST the internal and external customers begin to report that they can access SpaceIQ. At approximately, 6:08pm EST no additional reports were made, and the status page was then updated from monitoring to a resolved phase.  

 

On Monday, November 6, 2023, at approximately 4:24pm EST, external customers began to report the same disruption from Friday and their inability to access the SpaceIQ platform. All customers were notified of the Severity 1 disruption via status page at 4:52pm EST. At approximately 5:28pm EST, internal and external customers have began reporting the ability to access SpaceIQ again. The status page was updated from investigation to monitoring phase. After no additional reports and customers being able to confirm their ability to access the status page was marked as resolved at 8:44pm EST.  

 

On Tuesday, November 7, 2023, at approximately 9:39am EST an initial report to support mentioned the inability to access the SpaceIQ platform intermittently. Internal teams were notified and began investigating. 10:38pm EST internal users can access the platform, however at approximately 10:47pm EST additional customers began to report the intermittent disruption and all customers were made aware of the Severity 2 disruption via status page. Internal teams continue to investigate. At 5:30pm EST internal teams have identified and implemented the fix. Monitoring systems continued thru Wednesday, November 8, 2023. As customers confirmed the fix and no additional reports of disruptions were made, the status page was marked as resolved at 11:05am EST. 

 

Total Duration of Event: 

Friday, November 3, 2023 – 3 hours 24 minutes 

Monday, November 6, 2023 – 1 hour 4 minutes  

Tuesday, November 7, 2023 – 7 hours 53 minutes 

 

Root Cause Analysis: 

Our recent outage was caused by an underlying configuration issue with our cloud hosting provider. Our infrastructure was over-reliant on instances that could be unstable during scaling events. 

 

Preventative Action:  

We have allocated resources toward taking preventative action and our teams will continue to review internal processes and leverage appropriate resources to avoid this sort of disruption in the future.

Posted Nov 15, 2023 - 22:31 UTC

Resolved
This incident has been resolved.
Posted Nov 03, 2023 - 20:08 UTC
Identified
"The issue with SpaceIQ has been identified and a fix is being implemented. We are waiting on feedback from a 3rd party vendor

The next update will be posted in 2 hours per Eptura published guidelines https://eptura.com/terms/sla/
Posted Nov 03, 2023 - 18:16 UTC
Investigating
We are currently aware of an issue with SpaceIQ Our Engineering team is investigating to determine the cause of the disruption.

The next update will be posted in 2 hours per Eptura published guidelines https://eptura.com/terms/sla/
Posted Nov 03, 2023 - 17:59 UTC
This incident affected: System Status.