At 8:41 am ET on Wednesday, Nov 25th, Rigor detected a problem with our service and alerted our operations team of the issue. The root cause was an outage of the Amazon Web Services (AWS) Kinesis service in the US-EAST-1 region (https://status.aws.amazon.com/),,) which powers a number of Amazon services including CloudWatch, which Rigor relies on for monitoring of its services.
This caused a knock-on effect where our scheduler and results processing services were affected as they were unable to log their results to CloudWatch and without Cloudwatch data our engineers had limited visibility into our system. Rigor deployed a fix to work around the AWS issue (which was still ongoing) at 11:39 AM and the team began the process of bringing all services back to full availability. Unfortunately this outage meant that runs between approximately 10:17am ET and 3:47pm ET were not executed, and users may have seen a few duplicate runs scheduled as the system returned to normal.
In the short term, the team is making the change to remove Cloudwatch as a dependency permanent. Longer-term, we are doing some investigation to determine other points of critical dependency to add circuit breakers for better reliability to safeguard against future 3rd Party Outages.Thank you for your patience and your trust in Rigor. We know this is an important week for many of you and we are monitoring the situation very closely.