SiQ Detailed Root Cause Analysis – Severity 1 & 2 – April 20, 25 & 28
Intermittent Latency Using SpaceIQ Platform
Description:
On April 20, 2023, at approximately 2:40pm, customer support began to receive reports of issues throughout the SpaceIQ platform. When customers tried to navigate the platform, error messages such as “Timeout”, “Network Failed Request” would occur then customers would experience the latency and at one point, customers were unable to access SpaceIQ. Similar reports happened through the next week on, April 25th and again on the 28th.
Type of Event:
Performance Degradation
Services\Modules Impacted:
Production
Timeline:
April 20, 2023 |S1| SIQDEV1-5021
On Thursday, April 20, 2023, at approximately 2:40pm MST, customer support began receiving reports of issues throughout the SpaceIQ platform. When customers tried to navigate the platform, error messages such as “Timeout”, “Network Failed Request” would occur then customers would experience the latency and at one point, customers were unable to access SpaceIQ. At 3:15pm MST, The SpaceIQ engineering team quickly acknowledged the issue and began an Initial Investigation. At 3:15pm MST SpaceIQ customers were also notified of this issue via SpaceIQ Status Page and that we are actively investigating. 3:21pm our engineering team begins to restart internal services to help mitigate the performance issues. Around 8:47pm our product team begins to verify the stability and at 8:57pm confirmation is received that performance of SpaceIQ has improved. At 9:33pm MST, the Status Page was updated from Investigation to Monitoring. We continued to monitor the platform through Friday morning, April 21, 2023, as customers confirm that performance has also improved to a normal level, we marked the Status Page from Monitoring to Resolved at 6:15am.
Total Duration of Event: 15HRS 26MINS
April 25, 2023 |S2|SIQDEV1-5092
On Tuesday April 25, 2023, at approximately 3:06pm MST, customer support received an initial report with intermittent latency while using SpaceIQ. By 4:56pm MST, multiple reports were generated, and the support team was able to replicate the latency and all SpaceIQ customers were made aware of the intermittent latency through the Status Page and was set to an Investigating Phase. By 6:40pm the Status Page was updated from Investigating to an Identified Phase. 7:36pm the engineering team began to restart internal services to help mitigate the intermittent latency, and by 7:48pm the engineering team began seeing better performance and multiple customers also reported better performance of the platform. The Status Page was updated to a Monitoring Phase for the next hour and no additional reports were made and status page was marked as Resolved at 9:02pm MST.
Total Duration of Event: 5HRS 56MINS
April 28, 2023 |S2| SIQDEV1-5102
On Friday April 28, 2023, at approximately 3:29am MST, customer support received an initial report with intermittent latency and errors while using SpaceIQ. By 9:56am MST 2 additional customers reported the same behavior and engineering team was made aware of the reports and investigation began. 12:05pm additional customers began reporting the same behavior and all customers were made aware of the intermittent latency and errors through the Status Page to an Investigation Phase. 3:37pm the engineering team began to restart internal services to help mitigate the intermittent latency. The engineering team began seeing better performance and multiple customers also reported better performance of the platform and the Status Page was updated from Investigating to a Monitoring Phase. After monitoring, no additional reports were made, and all customers reported the stability of SpaceIQ to be back to normal. At 6:05pm MST the Status Page was updated from Monitoring to Resolved.
Total Duration of Event: 14HRS 36MINS
Root Cause Analysis:
In the 3 days that our customers experienced this performance degradation, our engineering team found that the root cause for these incidents were from the same internal service that was failing through each day.
Preventative Action:
The engineering team investigated each incident and continued to find and improve the internal service. Finally, reconfiguration was implemented to correct the slowness.