Results Processing
Incident Report for Rigor
Postmortem

At 8:41 am ET on Wednesday, Nov 25th, Rigor detected a problem with our service and alerted our operations team of the issue. The root cause was an outage of the Amazon Web Services (AWS) Kinesis service in the US-EAST-1 region (https://status.aws.amazon.com/),,) which powers a number of Amazon services including CloudWatch, which Rigor relies on for monitoring of its services.

This caused a knock-on effect where our scheduler and results processing services were affected as they were unable to log their results to CloudWatch and without Cloudwatch data our engineers had limited visibility into our system. Rigor deployed a fix to work around the AWS issue (which was still ongoing)  at 11:39 AM and the team began the process of bringing all services back to full availability. Unfortunately this outage meant that runs between approximately 10:17am ET and 3:47pm ET were not executed, and users may have seen a few duplicate runs scheduled as the system returned to normal. 

In the short term, the team is making the change to remove Cloudwatch as a dependency permanent.  Longer-term, we are doing some investigation to determine other points of critical dependency to add circuit breakers for better reliability to safeguard against future 3rd Party Outages.Thank you for your patience and your trust in Rigor. We know this is an important week for many of you and we are monitoring the situation very closely.

Posted Dec 01, 2020 - 19:26 EST

Resolved
This incident has been resolved.
Posted Nov 26, 2020 - 09:28 EST
Monitoring
Our team has deployed hotfix and verified that results are now being processed.
Posted Nov 25, 2020 - 12:02 EST
Identified
Our team is working to deploy a hotfix to work around the AWS service outage, which will allow results processing to resume. We will provide additional details as they are available.
Posted Nov 25, 2020 - 10:43 EST
Update
We are continuing to investigate this issue.
Posted Nov 25, 2020 - 09:03 EST
Update
We are continuing to investigate this issue.
Posted Nov 25, 2020 - 09:02 EST
Investigating
Beginning at 8:41 Eastern time, a fault has triggered in our results processing. This is causing check results to be delayed. This processing includes in-app reporting as well as all check related notifications.

We have identified the cause to be a service outage with AWS Cloudwatch in the us-east-1 region. During this outage, checks will continue to be executed and results collected. Once service has been restored, all collected results will be processed and notifications will be delivered.
Posted Nov 25, 2020 - 08:59 EST
This incident affected: Rigor Monitoring - Notifications (Email Alerts, Phone Alerts, SMS Alerts, Webhooks).