Managing Flaky Tests for High-Quality Product and Developer Velocity – Airtable’s Approach
Airtable is relied on by large enterprise customers to power their businesses. In order to provide a reliable and high-quality product, Airtable’s engineering team writes tens of thousands of tests, including unit tests, integration tests, and end-to-end tests. However, the large-scale monorepo used by Airtable poses some challenges, especially with regards to flaky tests. In this article, we’ll discuss Airtable’s approach to managing flaky tests to maintain both a high-quality product and high developer velocity.
Why Flaky Tests are a Challenge
Flaky tests are tests that can pass or fail without any code changes. They make it difficult for developers to trust the test results and slow down the development process. Airtable’s pragmatic approach acknowledges that end-to-end integration tests, including those that depend on timing, have value in exercising more of the actual code that will run in production. However, they can also be a source of flakiness.
Requirements for Continuous Integration (CI)
CI is crucial to enabling developer velocity, but there are two requirements that need to be met. First, CI result needs to be correct as flaky test failure erodes confidence in the test suite. Second, CI needs to be fast as waiting for a long-running build breaks the flow.
Reducing Flakiness on the Infrastructure Level
Airtable has done several things to reduce test flakiness on the infrastructure level. Each test file runs in the same docker container, and different tests that run in parallel are isolated from each other. CI leverages each test’s historical CPU utilization to allocate CPU cores to a docker container to make sure a test does not start to become flaky because of CPU throttling. CI is also able to run builds faster by effectively utilizing the host machine to run tests in parallel.
Basic Steps to Maintain a Trade-off
Although tests can be flaky, Airtable takes some basic steps to maintain a good trade-off between detecting regressions and enabling developer velocity. CI retries test failures up to 3 attempts (subject to an overall timeout) in order to mitigate the impact of flaky tests on developer velocity. Quarantining any tests that meet a threshold for being too flaky is also necessary to keep every developer’s builds unaffected by known flaky tests while the test owner is working on a fix.
Monitoring Test Reliability
To quarantine flaky tests, a better way to detect which tests have failed in an unreliable way is needed. Airtable set up a continuous build pipeline off the main branch to detect flaky tests. This pipeline runs the full test suite and reports any test retries and failures to an analytics platform. We build daily reports that highlight the flakiest tests over a recent period of time. We then quarantine the tests by submitting code changes and merging them into the main branch.
Automating the Process
Once Airtable had confidence in their quarantine policies, they automated the process of adding and removing tests from the quarantine list. Whenever a test fails on the main branch build, the automation system triggers a stress test for that test case. If there are too many failures in the stress test, the automation system adds the test case to the quarantine list. CI loads the latest quarantine list on start, and suppresses the failures if the testcase is in the quarantine list. The automation system keeps monitoring the reliability of quarantined tests in the main branch build and removes the test case from the quarantine list if there are sufficient numbers of builds without that test failing.
Fixing Flaky Tests
It is not enough to just quarantine the flaky test, the test needs to be fixed by the owner so we can keep high test coverage. Once developers are notified about the flaky test failure, everyone is very responsive in root causing and fixing the issues. Airtable’s ability to share data across the organization in a flexible manner is used to sync the quarantine list to an Airtable base and build Interfaces so that teams can easily visualize the quarantine list, and how it has changed over time. Automations are used to periodically publish a digest of quarantined tests to the Slack channels of each team that owned a test that was in the quarantine list.
Conclusion
Airtable’s Developer Effectiveness and Quality Engineering Teams have developed an effective approach to managing flaky tests that balances the need for a high-quality product and high developer velocity. By leveraging continuous build pipelines, stress tests, and automations, flaky tests can be quarantined and fixed efficiently, ensuring that Airtable maintains the trust of its customers and the pace of its development.
GIPHY App Key not set. Please check settings