Site Disruption Event - RCA

The following Root Cause Analysis (RCA) was performed as a result of an issue impacting customers using the ManagerPlus application on April 20, 2020.


Background
At 7:20 AM MDT on 4/20/2020, the ManagerPlus support team began to receive multiple reports showing a pattern of reduced performance and unavailability from certain customers. Support contacted the DevOps team and received confirmation that alerting systems were corroborating customer reports. Due to the critical nature of these alerts, the DevOps team initiated an investigation and began remediation efforts at 7:28 AM MDT.

The unavailability issue led to an intermittent S1-S2 severity event, resulting in a significant number of customers being unable to perform critical business operations until resolution.


Investigation & Cause
At 7:28 AM MDT on 4/20/2020, we immediately began investigating the internal alerts and customer reports and for expedient resolution of the issue. The disruption was a result of a cascading failure of non-optimized servers, resulting in errors with processing requests which rendered the application unusable. After numerous attempts to roll the systems back to a stable configuration, the DevOps team determined that the only path to sustainable recovery was to rebuild the application servers. These remediation efforts began at 7:35 AM MDT on 4/20/2020 and were completed at 2:30 AM MDT on 4/21/2020.


Remediation 
At 7:35 AM MDT on 4/20/2020, within 10 minutes of the initial report, the DevOps team began to cycle the application pools in an attempt to bring CPU load into normal range. The result was sporadic performance and additional alerting from our monitoring tools. At 8:42 AM MDT, remediation continued with rebooting systems in a staggered fashion, resulting in moderate stability being reported by 10:07 AM MDT. Unfortunately, internal monitoring showed additional errors at 11:38 AM MDT, after which time customers continued to experience sporadic performance and intermittent availability.

From 1:10 PM to 5:32 PM MDT, the DevOps team continued recovery efforts by systematically reverting configuration changes to previous states. By 7:59 PM MDT, it was clear that the recovery operations were not going to be sustainable in the long term. As a result, ManagerPlus initiated a plan to rebuild the application servers, a process which was completed at 2:30 AM MDT on 4/21/2020.

The server rebuild process was successful, and full application availability was restored at 2:30 AM MDT on 4/21/2020, ending the S1-S2 severity event.


Communication
At 7:20 AM MDT on 4/20/2020, customer support began reporting inbound customer calls related to this issue via designated internal channels. The DevOps team immediately picked up these reports and provided prompt updates for communication back to impacted customers. Unfortunately, due to the size and scope of the outage, we experienced some gaps in our direct communication to customers that are addressed in the Preventative Action and Analysis section.

At 12:00 PM MDT, as soon as it was clear that performance issues were going to be sustained, a message was added to the Login screen to communicate to customers that we were aware of the ongoing issue. Similar language was added to www.managerplus.com later in the afternoon as a second method of communication in the event customers could not reach the Login screen. These updates remained available to customers for the duration of the issue.


Preventative Action and Analysis
At 2:30 AM MDT on 4/21/2020, ManagerPlus successfully restored availability and ended the S1-S2 severity event for all impacted customers. Based on internal monitoring tools and testing, this event was is rated as an S1-level event for 1 hour and 18 minutes, and as an S2-level event for 17 hours and 48 minutes; however, as recovery was intermittent, we are aware that some customers may have had a better or worse impact depending on their time of access and other related factors as we worked to recover.

As a direct result of this event, ManagerPlus is taking a number of actions internally to reduce the likelihood of a similar event occurring in the future. The initial steps were taken as part of the remediation efforts with a rebuilding of our application servers. This project will continue in an ongoing fashion as we work to optimize resource availability and build resiliency into our infrastructure.

Additionally, while our internal alerts did work as intended to notify ManagerPlus teams of the issue, we have determined that the scope of the monitoring was inadequate to allow for proactive response to an event of this nature. Consequently, the DevOps team is performing a formal review of all existing alerts and adding alerts to areas that were underreported during this event.

Lastly, this event identified gaps in our customer communication strategy, with some customers waiting far too long for an update on the issue. As a result, our support team is working closely with the marketing and DevOps teams to develop a strategy for reporting downtime events in a more proactive manner. This new strategy should allow us not only to alert customers to issues as they occur but also to provide channels for ongoing, regular communication until these issues are fully resolved.

We anticipate that these actions will reduce the potential for similar issues going forward and increase our overall capacity to respond should we encounter a similar event in the future. 

 

 

 

Did you find it helpful? Yes No

Send feedback
Sorry we couldn't be helpful. Help us improve this article with your feedback.