How To Manage Flaky Tests in your CI Workflows

Li Haoyi, 1 January 2025

Many projects suffer from the problem of flaky tests: tests that pass or fail non-deterministically. These cause confusion, slow development cycles, and endless arguments between individuals and teams in an organization.

This article dives deep into working with flaky tests, from the perspective of someone who built the first flaky test management systems at Dropbox and Databricks and maintained the related build and CI workflows over the past decade. The issue of flaky tests can be surprisingly unintuitive, with many "obvious" approaches being ineffective or counterproductive. But it turns out there are right and wrong answers to many of these issues, and we will discuss both so you can better understand what managing flaky tests is all about.

What Causes Flaky Tests?

A flaky test is a test that sometimes passes and sometimes fails, non-deterministically.

Flaky tests can be caused by a whole range of issues:

Race conditions within tests

Often this manifest as sleep/time.sleep/Thread.sleep calls in your test expecting some concurrent code path to complete, which may or may not wait long enough depending on how much CPU contention there is slowing down your test code. But any multi-threaded or multi-process code has the potential for race conditions or concurrency bugs, and these days most systems make use of multiple cores.

Race conditions between tests

Two tests both reading and writing to the same global variables or files (e.g. in ~/.cache) causing non-deterministic outcomes. These can be tricky, because every test may pass when run alone, and the test that fails when run in parallel may not be the one that is misbehaving!

Test ordering dependencies

Even in the absence of parallelism, tests may interfere with one another by mutating global variables or files on disk. Depending on the exact tests you run or the order in which you run them, the tests can behave differently and pass or fail unpredictably. Perhaps not strictly non-deterministic - the same tests run in the same order will behave the same - but practically non-deterministic since different CI runs may run tests in different orders. Selective Testing may cause this kind of issue, or dynamic load-balancing of tests between parallel workers to minimize total wall clock time (which Dropbox’s Changes CI system did)

Resource contention

Depending on exactly how your tests are run, they may rely on process memory, disk space, file descriptors, ports, or other limited physical resources. These are subject to noisy neighbour problems, e.g. an overloaded linux system with too many processes using memory will start OOM-Killing processes at random.

External flakiness

Integration or end-to-end tests often interact with third-party services, and it is not uncommon for the service to be flaky, rate limit you, have transient networking errors, or just be entirely down for some period of time. Even fundamental workflows like "downloading packages from a package repository" can be subject to flaky failures

Sometimes the flakiness is test-specific, sometimes it is actually flakiness in the code being tested, which may manifest as real flakiness when customers are trying to use your software. Both scenarios manifest the same to developers - a test passing and failing non-deterministically when run locally or in CI.

Why Are Flaky Tests Problematic?

Development Slowdown

Flaky tests generally make it impossible to know the state of your test suite, which in turns makes it impossible to know whether the software you are working on is broken or not, which is the reason you wanted tests in the first place. Even a small number of flaky tests is enough to destroy the core value of your test suite.

Ideally, if the tests pass on a proposed code change, even someone unfamiliar with the codebase can be confident that the code change did not break anything.
Once test failures start happening spuriously, it quickly becomes impossible to get a fully "green" test run without failures
So in order to make validate a code change (merging a pull-request, deploying a service, etc.) the developer then needs to individually triage and make judgements on those test failures to determine if they are real issues or spurious

This means that you are back to the "manual" workflow of developers squinting at test failures to decide if they are relevant or not. This risks both false positives and false negatives:

A developer may waste time triaging a failure that turns out to be a flake, and unrelated to the change being tested
A developer may wrongly judge that a test failure is spurious, only for it to be a real failure that causes breakage for customers once released.

Even relatively low rates of flakiness can cause issues. For example, consider the following:

A codebase with 10,000 tests
1% of the tests are flaky
Each flaky test case fails spuriously 1% of the time

Although 1% of tests each failing 1% of the time may not seem like a huge deal, it means that someone running the entire test suite only has a 0.99^100 = ~37% chance of getting a green test report! The other 63% of the time, someone running the test suite without any real breakages gets one or more spurious failures, that they then have to spend time and energy triaging and investigating. If the developer needs to retry the test runs to get a successful result, they would need to retry on average 1 / 0.37 = 2.7 times: in this scenario the retries alone may be enough to increase your testing latencies and infrastructure costs by 170%, on top of the manual work needed to triage and investigate the test failures!

Inter-team Conflict

One fundamental issue with flaky tests is organizational:

The team that owns a test benefits from the test running and providing coverage for the code they care about
Other teams that run the test in CI suffer from the spurious failures and wasted time that hitting flaky tests entails

Selective Testing can help mitigate this to some extent by letting you avoid running unrelated tests, but it doesn’t make the problem fully disappear. For example, a downstream service may be triggered every time an upstream utility library is changed, and if the tests are flaky and the service and library are owned by different teams, you end up with the conflict described above.

What ends up happening is that nobody prioritizes fixing their flaky tests, because that is the selfishly-optimal thing to do, but as a result everyone suffers from everyone else’s flaky tests, even if everyone would be better if all flaky tests were fixed. This is a classic Tragedy of the Commons, and as long as flaky tests are allowed to exist, this will result in endless debates or arguments between teams about who needs to fix their flaky tests, wasting enormous amounts of time and energy.

Mitigating Flaky Tests

In general it is impossible to completely avoid flaky tests, but you can take steps to mitigate them:

Avoid race conditions in your application code to prevent random crashes or behavioral changes affecting users, and avoid race conditions in your test code
Run parallel test processes inside "sandbox" empty temp folders, to try and avoid them reading and writing to the same files on the filesystem and risking race conditions. (See Mill Sandboxing)
Run test processes inside CGroups to mitigate resource contention: e.g. if every test process is limited in how much memory it uses, it cannot cause memory pressure that might cause other tests to be OOM-killed (See Bazel’s Extended CGroup Support, which we implemented in Databricks' Runbot CI system)
Mock out external services: e.g. AWS and Azure can be mocked using LocalStack, parts of Azure Kubernetes can be mocked using KIND, etc..
Selective Testing, e.g. via Mill’s Selective Test Execution, reduces the number of tests you run and thus the impact of flakiness,

However, although you can mitigate the flakiness, you should not expect to make it go away entirely.

Race conditions will find their way into your code despite your best efforts, and you will need some hardcoded timeouts to prevent your test suite hanging forever.
There will always be some limited physical resource you didn’t realize could run out, until it does.
Mocking out third-party services never ends up working 100%: inevitably you hit cases where the mock isn’t accurate enough, or trustworthy enough, and you still need to test against the real service to get confidence in the correctness of your system.

End-to-end tests and integration tests are especially prone to flakiness, as are UI tests exercising web or mobile app UIs.

As a developer, you should work hard in trying to make your application and test code as deterministic as possible. You should have a properly-shaped Test Pyramid, with more small unit tests that tend to be stable and fewer integration/end-to-end/UI tests that tend to be flaky. But you should also accept that despite your best efforts, flaky tests will appear, and so you will need some plan or strategy to deal with them when they do.

How Not To Manage Flaky Tests

Flaky test management can be surprisingly counter-intuitive. Below we discuss some common mistakes people make when they first start dealing with flaky tests.

Do Not Block Code Changes on Flaky Tests

The most important thing to take note of is that you should not block code changes on flaky tests: merging pull-requests, deploying services, etc.

That is despite blocking code changes being the default and most obvious behavior: e.g. if you wait for a fully-green test run before merging a code change, and a flaky test makes the test run red, then it blocks the merge.However, this is not a good workflow for a variety of reasons:

A flaky failure when testing a code change does not indicate the code change caused that breakage.So blocking the merge on the flaky failure just prevents progress without actually helping increase system quality.
The flaky test may be in a part of the system totally unrelated to the code change being tested, which means the individual working on the code change has zero context on why it might be flaky, and unexpectedly context switching to deal with the flaky test is mentally costly.
Blocking progress on a flaky test introduces an incentives problem: The code/test owner benefits from the flaky test’s existence, but other people working in that codebase get blocked with no benefit. This directly leads to the endless Inter-team Conflict mentioned earlier.

Although "all tests should pass before merging" is a common requirement, it is ultimately unhelpful when you are dealing with flaky tests.

Preventing Flaky Tests From Being Introduced Is Hard

It can be tempting to try and "Shift Left" your flaky test management, to try and catch them before they end up landing in your codebase. But doing so ends up being surprisingly difficult.

Consider the example we used earlier: 10,000 tests, with 1% of them flaky, each failing 1% of the time. These are arbitrary numbers but pretty representative of what you will likely find in the wild

If someone adds a new test case, in order to have a 95% confidence that it is not flaky, you would need to run it about 300 times (log(0.05) / log(0.99)).
Even if we do run every new test 300 times, that 1 in 20 flaky tests will still slip through, and over time will still build up into a population of flaky tests actively causing flakiness in your test suite
Furthermore, many tests are not flaky alone! Running the same test 300 times in isolation may not demonstrate any flakiness, since e.g. the test may only be flaky when run in parallel with another test due to Race conditions between tests or Resource contention, or in a specific order after other tests due to Test ordering dependencies.
Lastly, it is not only new tests that are flaky! When I was working on this area at Dropbox and Databricks, the majority of flaky tests we detected were existing tests that were stable for days/weeks/months before turning flaky (presumably due to a code change in the application code or test code). Blocking new tests that are flaky does nothing to prevent the code changes causing old tests to become flaky!

To block code changes that cause either new and old tests from becoming flaky, we would need to run every single test about 300 times on each pull request, to give us 95% confidence that each 1% flaky test introduced by the code change would get caught. This is prohibitively slow and expensive, causing a test suite that may take 5 minutes to run costing $1 to instead take 25 hours to run costing $300.

In general, it is very hard to block flaky tests "up front". You have to accept that over time some parts of your test suite will become flaky, and then make plans on how to respond and manage those flaky tests when they inevitably appear.

Managing Flaky Tests

Once flaky tests start appearing in your test suite, you need to do something about them. This generally involves (a) noticing that flaky tests exist, (b) identifying which tests are flaky, and (c) mitigating those specific problematic test to prevent them from causing pain to your developers.

Monitor Flaky Tests Asynchronously

As mentioned earlier, Preventing Flaky Tests From Being Introduced Is Hard. Thus, you must assume that flaky tests will make their way into your test suite, and monitor the flakiness when it occurs. This can be done in a variety of ways, for example:

Most CI systems allow manual retries, and developers usually retry tests they suspect are flaky. If a test fails once then passes when retried on the same version of the code, it was a flaky failure. This is the metric we used in Databricks' CI system to monitor the flaky test numbers.
Some CI systems or test frameworks have automatic retries: e.g. in Dropbox’s Changes CI system all tests were retried twice by default. If a test fails initially and then passes on the retry, it is flaky: the fact that it’s non-deterministic means that next time, it might fail initially and then fail on the retry!
Most CI systems run tests to validate code changes before merging, and then run tests again to validate the code post-merge. Post-merge should "always" be green, but sometimes suffers breakages or flakiness. If a test passes, fails, then passes on three consecutive commit test runs post-merge, it is likely to be flaky. Breakages tend to cause a string of consecutive test failures before being fixed or reverted, and very rarely get noticed and dealt with immediately

Notably, most test failures when validating code changes (e.g. on pull requests) are not useful here: tests are meant to break when validating code changes in order to catch problems! Hence the need for the slightly-roundabout ways above to determine what tests are flaky, by looking for failures at times when you wouldn’t expect failures to occur.

Once you have noticed a test is flaky, there are two main options: retries and quarantine

Retrying Flaky Tests

Retries are always controversial. A common criticism is that they can mask real flakiness in the system that can cause real problems to customers, which is true. However, we already discussed why we should not block code changes on flaky tests, since doing so just causes pain while not being an effective way of getting the flakiness fixed.

Furthermore, developers are going to be manually retrying flaky tests anyway: whether by restarting the job validating their pull request, or running the test manually on their laptop or devbox to check if it’s truly broken. Thus, we should feel free to add automatic retries around flaky tests to automate that tedious manual process.

Retrying flaky tests can be surprisingly effective. As mentioned earlier, even infrequently flaky tests can cause issues, with a small subset of tests flaking 1% of the time being enough to block all progress. However, one retry turns it into a 0.01% flaky test, and two retries turns it into a 0.0001% flaky test. So even one or two retries is enough to make most flaky tests stable enough to not cause issues.

Retrying flaky tests has two weaknesses:

Retries can be expensive for real failures

If you retry a test twice, that means that an actually-failed test would run three times before giving up. If you retry every test by default, and a code change breaks a large number of them, running all those failing tests three times can be a significant performance and latency penalty

To mitigate this, you should generally avoid "blanket" retries, and only add retries around specific tests that you have detected as being flaky

Retries may not work if not coarse grained enough

For example, if test_a fails due to interference with test_b running concurrently, retrying test_a immediately while test_b is still running will fail again. Or if the flakiness is due to some bad state on the filesystem, the test may continue flaking until it is run on a completely new machine with a clean filesystem.

This failure mode can be mitigated by retrying the failed tests only after the entire test suite has completed, possibly on a clean test machine.

Auto-Quarantining Flaky Tests

Quarantine involves detecting that a test is flaky, and simply not counting it when deciding whether or not to accept a code change for merge or deployment.

This is much more aggressive than retrying flaky tests, as even real breakages will get ignored for quarantined tests. You effectively lose the test coverage given by a particular test for the period while it is quarantined. Only when someone eventually fixes the flaky test can it be removed from quarantine and can begin running and blocking code changes again.

Quarantining is best automated, both to remove busy-work of finding/quarantining flaky tests, and to avoid the inevitable back-and-forth between the people quarantining the tests and the people whose tests are getting quarantined.

Why Quarantine?

The obvious value of quarantining flaky tests is that it unblocks merging of code changes by ignoring flaky tests that are probably not relevant. Quarantime basically automates what people do manually in the presence of flaky tests anyway:

When enough tests are flaky, eventually developers are going to start merging/deploying code changes despite the failures being present, because getting a "fully green" test run is impossible
When that happens, the developer is not going to be able to tell whether the failure is flaky or real, so if a code change causes a real breakage in that test the developer is likely going to not notice and merge/deploy it anyway!

So although naively it seems like quarantining flaky tests cost you test coverage, in reality it costs you nothing and simply automates the loss of coverage that you are going to suffer anyway. It simply saves a lot of manual effort in having your developers manually deciding which test failures to ignore based on what tests they remember to be flaky, since now the quarantine system remembers the flaky tests and ignores them on your behalf.

Why Quarantine? Part 2

The non-obvious value of quarantining flaky tests is that it aligns incentives across a development team or organization:

Normally, a flaky test meant the test owner continues to benefit from the test coverage while other teams suffered from the flakiness
With auto-quarantine, a flaky test means the test owner both benefits from the test coverage for health tests and suffers the lack of coverage caused by their flaky test being quarantined.

This aligning of incentives means that with auto-quarantine enabled, the normal endless discussions and disputes about flaky tests tend to disappear. The test owner can decide themselves how urgently they need to fix a quarantined flaky test, depending on how crucial that test coverage is, or even if they should fix it at all! Other teams are not affected by the quarantined flaky test, and do not care what the test owner ends up deciding

Most commonly, quarantining is automatic, while un-quarantining a test can be automatic or manual. Due to the non-determinstic nature of flakiness, it’s often hard to determine whether a flaky test has been truly fixed or not, but it turns out it doesn’t matter. If you try to fix a test, take it out of quarantine, and it turns out to be still flaky, the auto-quarantine system will just put it back into quarantine for you to take another look at it.

Implementing Flaky Test Management Systems

So far, all the discussion in this article has been at a high level. Exactly how to implement it is left as an exercise to the reader, but is usually a mix of:

retry{} helpers in multiple languages you can sprinkle through your test code where necessary
A SQL database storing historical test results and retries
A SQL database or a text file committed in-repo to track quarantined tests
A service that looks at historical test results and retries and decides when/if to quarantine a test
Tweaks to your existing CI system to be able to work with all of the above: ignoring quarantined tests, tracking retry counts, tracking test results, etc.
Some kind of web interface giving you visibility into all the components and workflows above, so when things inevitably go wrong you are able to figure out what’s misbehaving

Usually flaky test management starts off as an entirely manual process, which works fine for small projects. But as the size of the project grows, you inevitably need to augment the manual work with some basic automation, and over time build out a fully automated system to do what you want. So far I have not seen a popular out-of-the-box solution for this, and in my interviews with ~30 silicon valley companies it seems everyone ends up building their own. The Dropbox CI System and Databricks CI System I worked on both had their flaky test management bespoke and built in to the infrastructure.

None of the techniques discussed in this article are rocket science, and the challenge is mostly just plumbing the necessary data back and forth between different parts of your CI system. But hopefully this high-level discussion of how to manage flaky tests should give you a head start, and save you the weeks or months it would take to learn the same things that I have learned working on flaky tests over the past decade.