Faster CI with Selective Testing

Li Haoyi, 24 December 2024

Selective testing is a key technique necessary for working with any large codebase or monorepo: picking which tests to run to validate a change or pull-request, because running every test every time is costly and slow. This blog post will explore what selective testing is all about, the different approaches you can take with selective testing, based on my experience working on developer tooling and CI for the last decade at Dropbox and Databricks. Lastly, we will discuss how the Mill build tool supports selective testing.

Large Codebases

Although codebases can be large, in any large codebase you are typically only working on a fraction of it at any point in time. As an example that we will use throughout this article, consider a large codebase or monorepo that contains the code for:

3 backend services: backend_service_1, backend_service_2, backend_service_3
2 frontend web codebases: frontend_web_1, frontend_web_2
1 backend_service_utils module, and 1 frontend_web_utils module
3 deployments: deployment_1, deployment_2, deployment_3, that make use of the backend services and frontend web codebases.
The three deployments may be combined into a staging_environment, which is then used in three sets of end_to_end_tests, one for each deployment

These modules may depend on each other as shown in the diagram below, with the various backend_service and frontend_web codebases grouped into deployments, and the _utils files shared between them:

These various modules would typically each have their own test suite, which for simplicity I left out of the diagrams.

Although the codebase described above is just an example, it reflects the kind of codebase structure that exists in many real-world codebases. Using this example, let us now consider a few ways in which selective testing can be used.

No Selective Testing

Most software projects start off naively running every test on every pull-request to validate the changes. For small projects this is fine. Every project starts small and most projects stay small, so this approach is not a problem at all.

However, for the projects that do continue growing, this strategy quickly becomes infeasible:

The size of the codebase (in number of modules, lines of code, or number of tests) grows linearly O(n) over time
The number of tests necessary to validate any pull-request also grows linearly O(n)

This means that although the test runs may start off running quickly, they naturally slow down over time:

A week-old project may have a test suite that runs in seconds
By a few months, the test suite may start taking minutes to run
After a few years, the tests may be taking over an hour.
And there is no upper bound: in a large growing project test runs can easily take several hours or even days to run

Although at a small scale waiting seconds or minutes for tests to run is not a problem, the fact that it grows unbounded means it will eventually become a problem on any growing codebase. Large codebases with test suites taking 4-8 hours to run are not uncommon at all, and this can become a real bottleneck in how fast developers can implement features or otherwise improve the functionality of the codebase.

In general, "no selective testing" works fine for small projects with 1-10 developers, but beyond that the inefficiency of running tests not relevant to your changes starts noticeably slowing down day to day development. At that point, the first thing people reach for is some kind of folder-based selective testing.

Folder-based Selective Testing

Typically, in any large codebase, most work happens within a single part of it: e.g. a developer may be working on backend_service_1, another may be working on frontend_web_2. The obvious thing to do would be make sure each module is in its own folder, e.g.

my_repo/
    backend_service_1/
    backend_service_2/
    backend_service_3/
    frontend_web_1/
    frontend_web_2/
    backend_service_utils/
    frontend_web_utils/
    deployment_1/
    deployment_2/
    deployment_3/
    ...

To do simple module-based selective execution generally involves:

Run a git diff on the code change you want to validate
For any folders which contain changed files, run their tests

For example,

A PR changing backend_service_1/ will need to run the backend_service_1/ tests
A PR changing frontend_web_2/ will need to run the frontend_web_2/ tests

This does limit the number of tests any individual needs to execute: someone working on backend_service_1 does not need to run the tests for backend_service_2, backend_service_3, etc.. This helps keep the test times when working on the monorepo from growing unbounded.

However, folder-based selective testing is not enough: it is possible that changing a module would not break its own tests, but it could cause problems with downstream modules that depend on it:

Changing backend_service_1 may require corresponding changes to frontend_web_1, and if those changes are not coordinated the integration tests in deployment_1 would fail
Changing frontend_web_utils could potentially break both frontend_web_1 and frontend_web_2 that depend on it, and thus we need to run the tests for both those modules to validate our change.

In these cases, the failure mode of folder-based selective testing is:

You change code in a folder
That folder’s tests may pass
You merge your change into the main repository
Only after merging, you notice other folders' tests failing, which you did not notice up front because you didn’t run their tests before merging. But because you merged the breaking change, you have inconvenienced other people working in other parts of the codebase
You now have to tediously revert your breaking change, or rush a fix-forward to un-break the folders whose tests you broke, and unblock the developers working in those folders

Folder-based selective testing works fine for codebases with 10-100 developers: there are occasional cases where a breakage might slip through, but generally it’s infrequent enough that it’s tolerable. But as the development organization grows beyond 100, these breakages affect more and more people and become more and more painful. To resolve this, we need something more sophisticated.

Dependency-based Selective Testing

To solve the problem of code changes potentially breaking downstream modules, we need to make sure that for every code change, we run both the tests for that module as well as every downstream test. For example, if we make a change to backend_service_1, we need to run the unit tests for backend_service_1 as well as the integration tests for deployment_1:

On the other hand, if we make a change to frontend_web_utils, we need to run the unit tests for frontend_web_1 and frontend_web_2, as well as the integration tests for deployment_1 and deployment_2, but not deployment_3 since (in this example) it doesn’t depend on any frontend codebase:

This kind of dependency-based selective test execution is generally straightforward:

You need to know which modules own which source files (e.g. based on the folder),
You need to know which modules depend on which other modules
Run a git diff on the code change you want to validate
For any modules which contain changed files, run a breadth-first traversal of the module graph
For all the modules discovered during the traversal, run their tests

The algorithm (i.e. a breadth first search) is pretty trivial, the interesting part is generally how you know "which modules own which source files" and "which modules depend on which other modules".

For smaller projects this can be managed manually in a bash or python script, e.g. this code in Apache Spark that manually maintains a list of source folders and dependencies per-module, as well as what command in the underlying build tool you need to run in order to test that module (sbt_test_goals):

tags = Module(
    name="tags",
    dependencies=[],
    source_file_regexes=["common/tags/"],
)

utils = Module(
    name="utils",
    dependencies=[tags],
    source_file_regexes=["common/utils/"],
    sbt_test_goals=["common-utils/test"],
)

kvstore = Module(
    name="kvstore",
    dependencies=[tags],
    source_file_regexes=["common/kvstore/"],
    sbt_test_goals=["kvstore/test"],
)

...

In a larger project maintaining this information by hand is tedious and error prone, so it is better to get the information from your build tool that already has it (e.g. via Mill Selective Execution).

An alternate mechanism for achieving dependency-based selective testing is via caching of test results, e.g. in tools like Bazel which support Remote Caching. In this approach, rather than using git diff and a graph traversal to decide what tests to run, we simply run every test and rely on the fact that tests that are run without any changes to their upstream dependencies will re-use a version from the cache automatically. Although the implementation is different, this caching-based approach largely has the same behavioral and performance characteristics as the git diff-based approach to dependency-based selective testing.

Limitations of Dependency-based Selective Testing

Dependency-based selective test execution can get you pretty far: 100s to 1,000 developers working on a shared codebase. But it still has weaknesses, and as the number of developers grows beyond 1,000, you begin noticing issues and inefficiency:

You are limited by the granularity of your module graph. For example, backend_service_utils may be used by all three backend_services, but not all of backend_service_utils is used by all three services. Thus a change to backend_service_utils may result in running tests for all three backend_services, even if that change may not affect that particular service
You may over-test things redundantly. For example, a function in backend_service_utils may be exhaustively tested in backend_service_utils's own test suite. If so, running unit tests for all three backend_services as well as integration tests for all three deployments may be unnecessary, as they will just exercise code paths that are already exercised as part of the backend_service_utils test suite

These failure modes are especially problematic for integration or end-to-end tests. The nature of end-to-end tests is that they depend on everything, and so you find any change in your codebase triggering every end-to-end_test to be run. These are also the slowest tests in your codebase, so running every end-to-end test every time you touch any line of code is extremely expensive and wasteful.

For example, touching backend_service_1 is enough to trigger all the end_to_end tests:

Touching frontend_web_2 is also enough to trigger all the end_to_end tests:

The two examples above demonstrate both failure modes:

staging_environment is very coarse grained, causing all end_to_end_tests to be run even if they don’t actually test the code in question
Every end_to_end_test likely exercises the same setup/teardown/plumbing code, in addition to the core logic under test, resulting in the same code being exercised redundantly by many different tests

This results in the selective testing system wasting both time and compute resources, running tests that aren’t relevant or repeatedly testing the same code paths over and over. While there are some ways you can improve the granularity of the module graph to mitigate these two issues, these issues are fundamental to dependency-based selective testing:

Dependency-based Selective Testing means a problematic code change cases all affected tests to break
For an effectivec CI system, all we need is that every problematic code change breaks at least one test

Fundamentally, CI just needs to ensure that every problematic code change breaks at least one thing, because that is usually enough for the pull-request author to act on it and resolve the problem. Running more tests to display more breakages is usually a waste of time and resources. Thus, although dependency-based selective testing helps, it still falls short of the ideal of how a CI system should behave.

Heuristic-based Selective Testing

The next stage of selective testing that most teams encounter is using heuristics: these are ad-hoc rules that you put in place to decide what tests to run based on a code change. Common heuristics include:

Limiting Dependency Depth

The chances are that a breaking in module X will be caught by X’s test suite, or the test suite of X’s direct downstream modules, so we don’t need to run every single transitive downstream module’s test suite. e.g. if we set N = 1, then a change to backend_service_1 shown below will only run tests for backend_service_1 and deployment_1, but not the end_to_end_tests downstream of the staging_environment:

At my last job, we picked N = 8 somewhat arbitrarily, but as a heuristic there is no "right" answer, and the exact choice can be chosen to tradeoff between thoroughness and test latency. The principle here is that "most" code is tested by its own tests and those of its direct dependencies, so running tests for downstream folders which are too far removed in the dependency graph is "generally" not useful.

Hard-coding Dependency Relationships

For example if I change deployment_1, I can choose to ignore the staging_environment when finding downstream tests, and only run end_to_end_test_1 since it is end-to-end tests for the deployment_1 module. We represent this by rendering staging_environment in dashed lines below, and adding additional arrows representing the hard-coded dependency from deployment_1 to end_to_end_test_1:

This approach does work, as developers usually have some idea of what tests should be run to exercise their application code. But maintaining these hard-coded dependency relationships is difficult in a large and evolving codebase:

Generally only the most senior engineers with the most experience working in the codebase are able to make such judgements
The nature of heuristics is that there is no right or wrong answer, and it is difficult to determine whether a selection of hard-coded dependencies is good or not

Fundamentally, tweaking magic configuration values to try and optimize an unclear result is not a good use of developer time, but in some cases doing so may be necessary in large codebases to keep pull-request validation time and cost under control.

Machine-Learning-based Selective Testing

The last option I’ve seen in the wild is machine-learning based test selection. This has two steps:

Train a machine learning model using the 10,000 commits on the codebase, with the git diff and any tests that were broken on.
For any new pull-request, feed the git diff into the model trained in (1) above to get a list of likely-affected tests, and only run those

This approach basically automates the manual tweaking described in Hard-coding Dependency Relationships, and instead of some senior engineer trying to use their intuition to guess what tests to run, the ML model does the same thing. You can then easily tune the model to optimize for different tradeoffs between latency and thoroughness, depending on what is important for your development team at any point in time.

One downside of machine-learning-based selective testing is the ML models are a black box, and you have very little idea of why they do what they do. However, all of the heuristics people use in Heuristic-based Selective Testing are effectively black boxes anyway, since you’ll be hard-pressed to come up with an answer for why someone Limiting Dependency Depth decided to set N = 8 rather than 7 or 9, or why someone Hard-coding Dependency Relationships decided on the exact hard-coded config they ended up choosing.

Limitations of Heuristics

The nature of heuristics is that they are approximate. That means that it is possible both that we run too may tests and waste time, and also that we run too few tests and allow a breakage to slip through. Typically this sort of heuristic is only used early in the testing pipeline, e.g.

When validating pull-requests, use heuristics to trim down the set of tests to run before merging
After merging a pull-request, run a more thorough set of tests without heuristics to catch any bugs that slipped through and prevent bugs from being shipped to customers.
If a bug is noticed during post-merge testing, bisect it and revert/fix the offending commit

This may seem hacky and complicated, and bisecting/reverting commits post-merge can indeed waste a lot of time. But such a workflow necessary in any large codebase and organization. The heuristics also do not need to be 100% precise, and as long as they are precise enough that the time saved skipping tests outweighs the time spend dealing with post-merge breakages, it still ends up being worth it.

Selective Testing in Mill

The Mill build tool’s Selective Test Execution supports Dependency-based Selective Testing out of the box. This makes it easy to set up CI for your projects using Mill that only run tests that are downstream of the code you changed in a pull-request. Selective Test Execution in Mill is implemented at the task level, so even custom tasks and overrides can benefit from it. Mill’s own pull-request validation jobs benefit greatly for selective testing, and you can see documentation-only pull-requests such as #4175 basically skipping the entire test suite since it did not touch any files that could affect those tests.

However, although Mill provides better support for selective testing that most build tools (which provide none), the Limitations of Dependency-based Selective Testing do cause issues. Even in Mill’s own pull-request validation jobs, the fact that the most expensive integration or end-to-end tests are selected every time causes slowness. What kind of Heuristic-based Selective Testing can help improve things remains an open question.

If you are interested in build tools, especially where they apply to selective testing on large codebases and monorepos, you should definitely give the Mill Build Tool a try! Mill’s support for selective testing can definitely help you keep pull-request validation times reasonable in a large codebase or monorepo, which is key to keeping developers productive and development velocity fast even as the size and scope of a project grows.