The following Root Cause Analysis (RCA) was performed as a result of an issue impacting a subset of customers using the ManagerPlus mobile application on April 22, 2020.
At 6:00 AM MDT on 4/22/2020, the ManagerPlus support team began to receive reports from a subset of customers regarding issues with the mobile application, specifically related to the mobile app being unable to sync with the desktop application. These reports increased in volume over the next two hours, and as a result, the Engineering team initiated an investigation at 7:59 AM MDT.
An issue of this nature would typically be treated as an S3 severity event due to its limited scope; however, due to the S1-S2 event earlier in the week, the decision was made to investigate the issue as an S2 event until resolution, raising the priority level for all teams involved in remediation.
Investigation & Cause
At 7:59 AM MDT on 4/22/2020, the Engineering team immediately began investigating customer reports to resolve the syncing issue. By 8:55 AM MDT, internal teams had traced the cause to an incomplete ACL configuration on the servers rebuilt during the event detailed in the RCA released for the S1-S2 event on 4/20/2020, resulting in access to update from the mobile app being disallowed for the impacted customers.
At 8:56 AM MDT on 4/22/2020, within minutes identifying the issue, the Engineering team began taking steps to configure the ACL to resume communication between the mobile and desktop applications. By 9:50 AM MDT, remediation efforts were successful, and communication was restored.
With mobile app communication restored, the S2-S3 severity event was officially ended at 9:50 AM MDT on 4/22/2020.
At 6:00 AM MDT on 4/22/2020, customer support began receiving inbound customer reports related to this issue via designated internal channels; however, due to the impact being limited to a subset of customers, the issue was not escalated until 7:59 AM MDT. This timeline is in line with standard processes for these types of reports. Upon escalation, the Engineering team immediately picked up these reports, initiated an expedited investigation, and resolved the issue within two hours of receiving the initial report. Internal communication processes worked as expected from the initial report to resolution.
Preventative Action and Analysis
At 9:50 AM MDT on 4/22/2020, ManagerPlus successfully restored availability to the subset of customers impacted by the S2-S3 mobile syncing issue. Based on customer reports and overall scope, this issue would typically be classified as an S3-level event; however, the determination was made to elevate its treatment to an S2-level event to ensure remediation was prioritized internally. During the formal post-mortem for this event, it was determined that the issue was an unfortunate consequence of changes made to recover the main application during the S1-S2 event on 4/20/2020. These changes, although necessary to recover the application, highlighted a gap in our emergency change control procedures.
As a direct result, ManagerPlus is reviewing and updating our internal documentation to address the gap in our server configuration procedure, reducing the likelihood of a similar event occurring as a result of a missed configuration step. Additionally, the Engineering team is adding this event to the ongoing review of existing alerts with the goal of identifying areas where we can increase our responsiveness and resiliency going forward.