Faster CI with Selective Testing
Li Haoyi, 24 December 2024
Selective testing is a key technique necessary for working with any large codebase or monorepo: picking which tests to run to validate a change or pull-request, because running every test every time is costly and slow. This blog post will explore what selective testing is all about, the different approaches you can take with selective testing, based on my experience working on developer tooling and CI for the last decade at Dropbox and Databricks. Lastly, we will discuss how selective testing is supported by the Mill build tool.
Large Codebases
Although codebases can be large, in any large codebase you are typically only working on a fraction of it at any point in time. As an example that we will use throughout this article, consider a large codebase or monorepo that contains the code for:
-
3 backend services:
backend_service_1
,backend_service_2
,backend_service_3
-
2 frontend web codebases:
frontend_web_1
,frontend_web_2
-
1
backend_service_utils
module, and 1frontend_web_utils
module -
3 deployments:
deployment_1
,deployment_2
,deployment_3
, that make use of the backend services and frontend web codebases. -
The three deployments may be combined into a
staging_environment
, which is then used in three sets ofend_to_end_test
s, one for each deployment
These modules may depend on each other as shown in the diagram below,
with the various backend_service
and frontend_web
codebases
grouped into deployments
, and the _utils
files shared between them:
These various modules would typically each have their own test suite, which for simplicity I left out of the diagrams.
Although the codebase described above is just an example, it reflects the kind of codebase structure that exists in many real-world codebases. Using this example, let us now consider a few ways in which selective testing can be used.
No Selective Testing
Most software projects start off naively running every test on every pull-request to validate the changes. For small projects this is fine. Every project starts small and most projects stay small, so this approach is not a problem at all.
However, for the projects that do continue growing, this strategy quickly becomes infeasible:
-
The size of the codebase (in number of modules, lines of code, or number of tests) grows linearly
O(n)
over time -
The number of tests necessary to validate any pull-request also grows linearly
O(n)
This means that although the test runs may start off running quickly, they naturally slow down over time:
-
A week-old project may have a test suite that runs in seconds
-
By a few months, the test suite may start taking minutes to run
-
After a few years, the tests may be taking over an hour.
-
And there is no upper bound: in a large growing project test runs can easily take several hours or even days to run
Although at a small scale waiting seconds or minutes for tests to run is not a problem, the fact that it grows unbounded means it will eventually become a problem on any growing codebase. Large codebases with test suites taking 4-8 hours to run are not uncommon at all, and this can become a real bottleneck in how fast developers can implement features or otherwise improve the functionality of the codebase.
In general, "no selective testing" works fine for small projects with 1-10 developers, but beyond that the inefficiency of running tests not relevant to your changes starts noticeably slowing down day to day development. At that point, the first thing people reach for is some kind of folder-based selective testing.
Folder-based Selective Testing
Typically, in any large codebase, most work happens within a single part of it: e.g. a
developer may be working on backend_service_1
, another may be working on frontend_web_2
.
The obvious thing to do would be make sure each module is in its own folder, e.g.
my_repo/
backend_service_1/
backend_service_2/
backend_service_3/
frontend_web_1/
frontend_web_2/
backend_service_utils/
frontend_web_utils/
deployment_1/
deployment_2/
deployment_3/
...
To do simple module-based selective execution generally involves:
-
Run a
git diff
on the code change you want to validate -
For any folders which contain changed files, run their tests
For example,
-
A PR changing
backend_service_1/
will need to run thebackend_service_1/
tests -
A PR changing
frontend_web_2/
will need to run thefrontend_web_2/
tests
This does limit the number of tests any individual needs to execute: someone working on
backend_service_1
does not need to run the tests for backend_service_2
, backend_service_3
,
etc.. This helps keep the test times when working on the monorepo from growing unbounded.
However, folder-based selective testing is not enough: it is possible that changing a module would not break its own tests, but it could cause problems with downstream modules that depend on it:
-
Changing
backend_service_1
may require corresponding changes tofrontend_web_1
, and if those changes are not coordinated the integration tests indeployment_1
would fail -
Changing
frontend_web_utils
could potentially break bothfrontend_web_1
andfrontend_web_2
that depend on it, and thus we need to run the tests for both those modules to validate our change.
In these cases, the failure mode of folder-based selective testing is:
-
You change code in a folder
-
That folder’s tests may pass
-
You merge your change into the main repository
-
Only after merging, you notice other folders' tests failing, which you did not notice up front because you didn’t run their tests before merging. But because you merged the breaking change, you have inconvenienced other people working in other parts of the codebase
-
You now have to tediously revert your breaking change, or rush a fix-forward to un-break the folders whose tests you broke, and unblock the developers working in those folders
Folder-based selective testing works fine for codebases with 10-100 developers: there are occasional cases where a breakage might slip through, but generally it’s infrequent enough that it’s tolerable. But as the development organization grows beyond 100, these breakages affect more and more people and become more and more painful. To resolve this, we need something more sophisticated.
Dependency-based Selective Testing
To solve the problem of code changes potentially breaking downstream modules, we need to make
sure that for every code change, we run both the tests for that module as well as every downstream
test. For example, if we make a change to backend_service_1
, we need to run the unit tests for
backend_service_1
as well as the integration tests for deployment_1
:
On the other hand, if we make a change to frontend_web_utils
, we need to run the unit tests
for frontend_web_1
and frontend_web_2
, as well as the integration tests for deployment_1
and deployment_2
, but not deployment_3
since (in this example) it doesn’t depend on any frontend codebase:
This kind of dependency-based selective test execution is generally straightforward:
-
You need to know which modules own which source files (e.g. based on the folder),
-
You need to know which modules depend on which other modules
-
Run a
git diff
on the code change you want to validate -
For any modules which contain changed files, run a breadth-first traversal of the module graph
-
For all the modules discovered during the traversal, run their tests
The algorithm (i.e. a breadth first search) is pretty trivial, the interesting part is generally how you know "which modules own which source files" and "which modules depend on which other modules".
-
For smaller projects this can be managed manually in a bash or python script, e.g. this code in Apache Spark that manually maintains a list of source folders and dependencies per-module, as well as what command in the underlying build tool you need to run in order to test that module (
sbt_test_goals
):
tags = Module(
name="tags",
dependencies=[],
source_file_regexes=["common/tags/"],
)
utils = Module(
name="utils",
dependencies=[tags],
source_file_regexes=["common/utils/"],
sbt_test_goals=["common-utils/test"],
)
kvstore = Module(
name="kvstore",
dependencies=[tags],
source_file_regexes=["common/kvstore/"],
sbt_test_goals=["kvstore/test"],
)
...
-
In a larger project maintaining this information by hand is tedious and error prone, so it is better to get the information from your build tool that already has it (e.g. via Mill Selective Execution).
An alternate mechanism for achieving dependency-based selective testing is via caching
of test results, e.g. in tools like Bazel which support Remote Caching.
In this approach, rather than using git diff
and a graph traversal to decide what tests
to run, we simply run every test and rely on the fact that tests that are run without
any changes to their upstream dependencies will re-use a version from the cache automatically.
Although the implementation is different, this caching-based approach largely has the
same behavioral and performance characteristics as the git diff
-based approach
to dependency-based selective testing.
Limitations of Dependency-based Selective Testing
Dependency-based selective test execution can get you pretty far: 100s to 1,000 developers working on a shared codebase. But it still has weaknesses, and as the number of developers grows beyond 1,000, you begin noticing issues and inefficiency:
-
You are limited by the granularity of your module graph. For example,
backend_service_utils
may be used by all threebackend_service
s, but not all ofbackend_service_utils
is used by all three services. Thus a change tobackend_service_utils
may result in running tests for all threebackend_service
s, even if that change may not affect that particular service -
You may over-test things redundantly. For example, a function in
backend_service_utils
may be exhaustively tested inbackend_service_utils
's own test suite. If so, running unit tests for all threebackend_service
s as well as integration tests for all threedeployment
s may be unnecessary, as they will just exercise code paths that are already exercised as part of thebackend_service_utils
test suite
These failure modes are especially problematic for integration or end-to-end tests. The nature of end-to-end tests is that they depend on everything, and so you find any change in your codebase triggering every end-to-end_test to be run. These are also the slowest tests in your codebase, so running every end-to-end test every time you touch any line of code is extremely expensive and wasteful.
For example, touching backend_service_1
is enough to trigger all the end_to_end
tests:
Touching frontend_web_2
is also enough to trigger all the end_to_end
tests:
The two examples above demonstrate both failure modes:
-
staging_environment
is very coarse grained, causing allend_to_end_test
s to be run even if they don’t actually test the code in question -
Every
end_to_end_test
likely exercises the same setup/teardown/plumbing code, in addition to the core logic under test, resulting in the same code being exercised redundantly by many different tests
This results in the selective testing system wasting both time and compute resources, running tests that aren’t relevant or repeatedly testing the same code paths over and over. While there are some ways you can improve the granularity of the module graph to mitigate these two issues, these issues are fundamental to dependency-based selective testing:
-
Dependency-based Selective Testing means a problematic code change cases all affected tests to break
-
For an effectivec CI system, all we need is that every problematic code change breaks at least one test
Fundamentally, CI just needs to ensure that every problematic code change breaks at least one thing, because that is usually enough for the pull-request author to act on it and resolve the problem. Running more tests to display more breakages is usually a waste of time and resources. Thus, although dependency-based selective testing helps, it still falls short of the ideal of how a CI system should behave.
Heuristic-based Selective Testing
The next stage of selective testing that most teams encounter is using heuristics: these are ad-hoc rules that you put in place to decide what tests to run based on a code change. Common heuristics include:
Limiting Dependency Depth
The chances
are that a breaking in module X will be caught by X’s test suite, or the test suite of
X’s direct downstream modules, so we don’t need to run every single transitive downstream
module’s test suite. e.g. if we set N = 1
, then a change to backend_service_1
shown below
will only run tests for backend_service_1
and deployment_1
, but not the end_to_end_test
s
downstream of the staging_environment
:
At my last job, we picked N = 8
somewhat arbitrarily, but as a heuristic there is
no "right" answer, and the exact choice can be chosen to tradeoff between thoroughness
and test latency. The principle here is that "most" code is tested
by its own tests and those of its direct dependencies, so running tests for downstream
folders which are too far removed in the dependency graph is "generally" not useful.
Hard-coding Dependency Relationships
For example if I change deployment_1
, I can choose to ignore the staging_environment
when finding downstream tests, and only run end_to_end_test_1
since it is end-to-end
tests for the deployment_1
module. We represent this by rendering staging_environment
in dashed lines below, and adding additional arrows representing the hard-coded dependency
from deployment_1
to end_to_end_test_1
:
This approach does work, as developers usually have some idea of what tests should be run to exercise their application code. But maintaining these hard-coded dependency relationships is difficult in a large and evolving codebase:
-
Generally only the most senior engineers with the most experience working in the codebase are able to make such judgements
-
The nature of heuristics is that there is no right or wrong answer, and it is difficult to determine whether a selection of hard-coded dependencies is good or not
Fundamentally, tweaking magic configuration values to try and optimize an unclear result is not a good use of developer time, but in some cases doing so may be necessary in large codebases to keep pull-request validation time and cost under control.
Machine-Learning-based Selective Testing
The last option I’ve seen in the wild is machine-learning based test selection. This has two steps:
-
Train a machine learning model using the 10,000 commits on the codebase, with the
git diff
and any tests that were broken on. -
For any new pull-request, feed the
git diff
into the model trained in (1) above to get a list of likely-affected tests, and only run those
This approach basically automates the manual tweaking described in Hard-coding Dependency Relationships, and instead of some senior engineer trying to use their intuition to guess what tests to run, the ML model does the same thing. You can then easily tune the model to optimize for different tradeoffs between latency and thoroughness, depending on what is important for your development team at any point in time.
One downside of machine-learning-based selective testing is the ML models are a black box,
and you have very little idea of why they do what they do. However, all of the heuristics
people use in Heuristic-based Selective Testing are effectively black boxes anyway,
since you’ll be hard-pressed to come up with an answer for why someone
Limiting Dependency Depth decided to set N = 8
rather than 7
or 9
, or why
someone Hard-coding Dependency Relationships decided on the exact hard-coded config they
ended up choosing.
Limitations of Heuristics
The nature of heuristics is that they are approximate. That means that it is possible both that we run too may tests and waste time, and also that we run too few tests and allow a breakage to slip through. Typically this sort of heuristic is only used early in the testing pipeline, e.g.
-
When validating pull-requests, use heuristics to trim down the set of tests to run before merging
-
After merging a pull-request, run a more thorough set of tests without heuristics to catch any bugs that slipped through and prevent bugs from being shipped to customers.
-
If a bug is noticed during post-merge testing, bisect it and revert/fix the offending commit
This may seem hacky and complicated, and bisecting/reverting commits post-merge can indeed waste a lot of time. But such a workflow necessary in any large codebase and organization. The heuristics also do not need to be 100% precise, and as long as they are precise enough that the time saved skipping tests outweighs the time spend dealing with post-merge breakages, it still ends up being worth it.
Selective Testing in Mill
The Mill build tool’s Selective Test Execution supports Dependency-based Selective Testing out of the box. This makes it easy to set up CI for your projects using Mill that only run tests that are downstream of the code you changed in a pull-request. Selective Test Execution in Mill is implemented at the task level, so even custom tasks and overrides can benefit from it. Mill’s own pull-request validation jobs benefit greatly for selective testing, and you can see documentation-only pull-requests such as #4175 basically skipping the entire test suite since it did not touch any files that could affect those tests.
However, although Mill provides better support for selective testing that most build tools (which provide none), the Limitations of Dependency-based Selective Testing do cause issues. Even in Mill’s own pull-request validation jobs, the fact that the most expensive integration or end-to-end tests are selected every time causes slowness. What kind of Heuristic-based Selective Testing can help improve things remains an open question.
If you are interested in build tools, especially where they apply to selective testing on large codebases and monorepos, you should definitely give the Mill Build Tool a try! Mill’s support for selective testing can definitely help you keep pull-request validation times reasonable in a large codebase or monorepo, which is key to keeping developers productive and development velocity fast even as the size and scope of a project grows.