Check Scheduling Delays

Incident Report for Rigor

Postmortem

Splunk Synthetics recently had an issue with checks operating at a slower frequency, resulting in a lower number of runs and a longer delay in between runs than was expected. While our team deployed several immediate fixes to improve the issue, ultimately the root cause of the delays in our check queuing system delays were due to our database server reaching its maximum capacity.

When our Operations team attempted to scale up the database, we encountered previously unknown incompatibilities that prevented the team from scaling the database any further. We then had to undertake a much more complex maintenance of the database to migrate to a new platform.

Our team then worked as fast as possible to build and test a migration plan, practicing it first on our staging platform and performing a dry run on production, before we then opened an emergency maintenance window to make the change to production.

The team confirmed the migration immediately resolved the issue.

It is worth noting that the migration was also an upgrade, which provides plenty of scalability (vertical and horizontal) options for the future.

In addition, throughout the incident we identified and corrected a number of operational improvement areas that were hindering our ability to quickly identify and troubleshoot the problem, resulting in elongated time to detect and resolve the issue. Specifically, we improved our monitoring and operational visibility on the scheduling function and the queuing platform.

We would like to apologize for the scope of impact that this incident had on our customers and their businesses. We take any incident that affects the availability and reliability of our customers extremely seriously. We are continuing to conduct a thorough investigation of the incident and will be making the changes which result from that investigation our top priority in Splunk Synthetics Engineering.

Posted Mar 12, 2021 - 14:05 EST

Resolved

After monitoring the results of the database maintenance, we are confident that we have resolved the check queuing delay issue. We will be posting a post mortem of this incident next week.

Posted Mar 05, 2021 - 14:38 EST

Update

Engineers are continuing to monitor the system. All checks continue to be running on schedule and frequency, and everything looks fully operational. We will continue monitoring throughout the day, and will update if any changes.

Posted Mar 05, 2021 - 09:07 EST

Monitoring

The database upgrade has been completed successfully. All check types are now queueing at an optimal level, and the team is continuing to monitor. We'll continue to monitor the situation over the next few days, and will update this incident if there is any change in situation.

As part of the upgrade, checks were paused from 21:20 EST until 23:11 EST, although there may have been sporatic runs during that period of time. You may see some duplicate runs right around 23:11, but as of the end of the upgrade, all checks should be running on their correct schedules, and at the correct frequency once again.

Posted Mar 05, 2021 - 00:05 EST

Update

We continually working on this issue and seeing latency normalize.

Posted Mar 04, 2021 - 13:45 EST

Update

We have identified additional improvements to be made to continue to address global check queueing delays. We plan on implementing those tonight @ 9PM EST during a maintenance window, where there may be a short outage where checks will not be run. As part of the changes we have implemented, we have reduced the delay, and continue to make changes to ensure that we eliminate the delay as well as plan for future capacity growth.

Posted Mar 04, 2021 - 11:19 EST

Identified

We have identified an issue affecting check run scheduling, and some checks are experiencing delays between runs across multiple locations. We currently have fixes in place to help minimize delays and are working towards implementing a longer term solution.

Posted Mar 02, 2021 - 19:01 EST

This incident affected: Rigor Monitoring (Application Website (https://monitoring.rigor.com)).