Splunk Synthetics recently had an issue with checks operating at a slower frequency, resulting in a lower number of runs and a longer delay in between runs than was expected. While our team deployed several immediate fixes to improve the issue, ultimately the root cause of the delays in our check queuing system delays were due to our database server reaching its maximum capacity.
When our Operations team attempted to scale up the database, we encountered previously unknown incompatibilities that prevented the team from scaling the database any further. We then had to undertake a much more complex maintenance of the database to migrate to a new platform.
Our team then worked as fast as possible to build and test a migration plan, practicing it first on our staging platform and performing a dry run on production, before we then opened an emergency maintenance window to make the change to production.
The team confirmed the migration immediately resolved the issue.
It is worth noting that the migration was also an upgrade, which provides plenty of scalability (vertical and horizontal) options for the future.
In addition, throughout the incident we identified and corrected a number of operational improvement areas that were hindering our ability to quickly identify and troubleshoot the problem, resulting in elongated time to detect and resolve the issue. Specifically, we improved our monitoring and operational visibility on the scheduling function and the queuing platform.
We would like to apologize for the scope of impact that this incident had on our customers and their businesses. We take any incident that affects the availability and reliability of our customers extremely seriously. We are continuing to conduct a thorough investigation of the incident and will be making the changes which result from that investigation our top priority in Splunk Synthetics Engineering.