Strategies for Efficiently Parallelizing JVM Test Suites

Li Haoyi, 17 March 2025

Test suites are in theory the ideal workload to parallelize, as they usually contain a large number of independent tests that can each be run in parallel. But implementing parallelism in practice can be challenging: a naive implementation can easily result in increased resource usage without any speedup, or even slow things down compared to running things on a single thread.

This blog post will explore the design and evolution of the Mill JVM build tool’s test parallelism strategy, from its start as a simple serial test runner, to naive module-based and class-based sharding, to the dynamic sharding strategy implemented in the latest version of Mill 0.12.9. We will discuss the pros and cons of the different approaches to test parallelization, analyze how they perform both theoretically and with benchmarks, and compare them to the runtime characteristics of other build tool test runners.

Serial Execution

The Mill build tool started off without any parallelism by default, and that extended to tests as well. When asked to execute tasks, Mill would:

Receive the build tasks specified at the command line
Do a breadth first search on the task graph to find the full list of transitive tasks
Sort the tasks in topological order using Tarjan’s Algorithm
Execute the tasks in order one at a time, skipping those with earlier cached values it can re-use
If any tasks contained test suites, these would be run one at a time in a single JVM subprocess

While this serial execution works fine, it’s unsatisfying in a world of modern computers each of which has anywhere from 8-16 CPU cores you can make use of. You may be waiting seconds or minutes for your tests to run on 1 CPU core while the other 9 cores are sitting idle.

To evaluate how well Serial Execution and later parallelism strategies, we performed two analyses:

A theoretical analysis using a simplified example build, the kind you would do on a whiteboard
A practical benchmark using two real-world codebases with very different testing workloads

These benchmarks are rough, but are enough to give a good understanding of the benefits and tradeoffs involved with the different parallelization strategies we discuss.

Theoretical Evaluation

To understand the concepts behind each strategy, we imagine using it to run the tests on a simple example build with:

3 test modules, ModuleA, ModuleB, and ModuleC
These three test modules have 2,4, and 6 test classes respectively
Running in an environment with 3 CPUs available

We can see this visualized below for serial execution:

The colored boxes represent test classes. This is a common unit of managing tests in the JVM ecosystem, and will serve as the base level of granularity for this article.
The arrows represent the threads on which the test classes execute on, and in the case of serial execution they all execute on a single thread.
The dashed boxes represent the testrunner processes that were spawned. In this case, every module’s tests are run in a separate process, which is again a common approach to isolate the code of different modules from one another.

For this article, we assume that tests for different modules need to be run in different subprocesses. That is not universally true: Mill does have a testLocal mode that runs tests in a shared JVM to reduce overhead, at the expense of some flexibility and isolation, and other build tools have something similar. But we find that process-separation of different modules' test suites is a useful enough default that we will assume it necessary for the rest of this article.

Given this example test run, there are two numbers we would like to minimize:

Time taken for all tests to complete. For this analyis, we treat each TestClass as taking the same amount of time, and just add up the number of suites on the longest thread
Processes spawned. Spawning processes are expensive, especially JVM processes that may take several seconds to launch and warm up to peak performance. Again, we just assume that every process spawn has the same overhead, and add it up

The Serial Execution, the numbers are:

Serial Execution

Time Taken

Processes

Practical Evaluation

For the practical evaluation, we considered the test suites of two different codebases:

The subset of unit tests in the Netty codebase that run in the Mill example Netty build. These contain a large number modules (17) with a large number of test classes (233), but each test class runs relatively quickly (~0.1s). This kind of testing workload is often seen in library codebases where all logic and tests can take place quickly in memory.
The tests of Mill’s own scalalib module. This is a single large module with a large number of test classes (52), but each test class runs relatively slowly (~10s). While not ideal, this kind of testing workload is common in monolithic application codebases with heavy integration testing.

In summary:

Modules

Test Classes

Average Duration per Test Class

Netty unit tests

233

~0.1s

Mill scalalib tests

~10s

The commands to run these two benchmarks are shown below, with -j1 telling Mill to run things on a single thread:

netty$ ./mill show 'codec-{dns,haproxy,http,http2,memcache,mqtt,redis,smtp,socks,stomp,xml}.__.discoveredTestClasses' + 'transport-{blockhound-tests,native-unix-common,sctp}.__.discoveredTestClasses'
mill$ ./mill -j1 scalalib.test

The selection of test suites in the Netty codebase is somewhat arbitrary (the tests that the example build happens to contain), but that doesn’t matter since we will be running the same selection of tests throughout this article to see the effect of these tests.

These two workloads are very different, and benefit from different characteristics in the parallel test runner:

For fast unit tests, minimizing the number of processes spawned is important, since the 0.1s it take to run the tests themselves can easily be dominated by 1s overhead starting up a JVM test process
For slower integration tests, the minimizing the number of processes matters less, as adding 1s of process spawning overhead to a 10s test class is inconvenient but not overwhelming

We will see how these numbers vary as we explore different testing strategies below, but as a baseline the time taken for running these test suites under Serial Execution is as follows

Serial Execution

Netty unit tests

28s

Mill scalalib tests

502s

These results are from running the above commands ad-hoc on my M1 Macbook Pro with 10 cores and Java 17. The exact numbers will vary based on what test suite you choose and what hardware you run it, but the overall trends and conclusions should be the same.

Module Sharding

Mill has always had task-level parallelism opt-in via the -j/--jobs flag (the name taken from the Make tool), and it became the default in Mill 0.12.0 for tasks to run parallel to use all cores on your system. During testing, typically each Mill module foo would have a single foo.test sub-module, with a single foo.test.testForked task. This means that Mill’s task-level parallelism would effectively shards your test suites at a module level.

One consequence of this is that if your codebase was broken up into many small modules, each module’s tests could run in parallel. But if your codebase had a few large modules, you may not be able to make full use of all the CPU cores available on your machine.

Visualizing this on the theoretical example we saw earlier:

	Serial Execution	Module Sharding
Time Taken	12	6
Processes	3	3

We can see that because the three modules have different numbers of test classes within them, ModuleA.test finishes first and that thread/CPU is idle until ModuleB.test and ModuleC.test finish later. While not ideal, this is a significant improvement over Serial Execution in our theoretical example, shortening the time taken from 12 to 6, while preserving the number of processes spawned at 3.

The practical benchmarks also show significant improvements for the Netty unit tests, running 3x faster as they can take full advantage of the multiple cores on the machine parallelize the test suites of the 17 modules being tested. However the Mill scalalib tests show no significant speedup, as the benchmark is a single large module that does not benefit from module sharding.

Serial Execution

Module Sharding

Netty unit tests

28s

10s

Mill scalalib tests

502s

477s

While in theory it would be ideal to break up large monoliths into multiple smaller modules each with their own test suite, doing so is tedious and manual, and realistically does not happen as often or as quickly as one might prefer. Thus a build tool needs to be able to handle these large monolithic modules and their large monolithic test suites in some reasonable manner.

Static Class Sharding

To work around the limitations of module sharding, Mill 0.12.0 introduced static class sharding, opt-in via the def testForkGrouping flag. This allows the developer to take the Seq[String] containing all the test class names and return a nested Seq[Seq[String]] with the original list broken down into groups. Each test group would run in parallel in a separate process in a separate folder, but within each group the tests would still run sequentially.

For example, the following configuration would take the list of test classes and break it down into 1-element groups:

def testForkGrouping = discoveredTestClasses().grouped(1).toSeq

Using static test sharding, the execution of the test suites in our theoretical example now has each test class assigned its own process (dashed boxes), and those processes making full use of the three cores available in the example:

	Serial Execution	Module Sharding	Static Class Sharding
Time Taken	12	6	4
Processes	3	3	12

Here we have shortened the time taken further, from 6 sequential test suites to just 4. However, it has come at the cost of spawning significantly more processes, as each 1-testclass group is allocated its own process.

Our practical benchmarks reflect this change as well:

Serial Execution

Module Sharding

Static Class Sharding

Netty unit tests

28s

10s

51s

Mill scalalib tests

502s

477s

181s

The Netty unit test benchmark has lots of small fast test classes, and so spawning a process for each test class is very expensive. We see the time taken to run all tests ballooning from 10s to 51s, as any improvement in parallelism is dominated by the cost of spawning the additional processes
For the Mill scalalib test benchmark which has slow test classes that take ~10 seconds each, spawning a process for each test class is a much smaller cost. And so the increased parallelism is able to provide a 2-3x speedup

The basic problem with static test sharding is that the ideal sharding depends on the runtime characteristics of your test suite.

Small, fast test classes would benefit from having a coarse-grained sharding with many test classes per group. This amortizes the cost of spawning a process, while there are enough test classes that even a coarse-grained grouping would provide plenty of opportunities for parallelism
Large, slow test classes would prefer a fine-grained sharding with only one test class per group. This maximizes parallelism, while the cost of spawning processes is small compared to the cost of running even a single test class.

Figuring out the ideal sharding for a given test suite can only be figured out experimentally, and keeping the sharding optimal as the test suite evolves over time is basically impossible. And as you can see from the numbers above, static sharding could easily make things worse if mis-configured!

Thus although group-based parallelism serves as a reasonable band-aid for specific modules where you can put in the effort to tune the grouping, the amount of manual tuning and room for error means it could never be widely used or turned on by default by the build tool.

Dynamic Sharding

To try and solve the problems with static test sharding, mill#4614 by @HollandDM introduced dynamic sharding using a process pool. This is opt-in via def testParallelism = true in Mill 0.12.9, and will become the default in the next major version Mill 0.13.0.

The idea of dynamic sharding is that you never had more the NUM_CPUS tests running in parallel anyway, so you could just spawn NUM_CPUS child processes and have those processes pull tests off a queue and run them until the queue is empty. This meant the JVM startup overhead was proportional to NUM_CPUS rather than NUM_TESTS, a much smaller number resulting in much smaller JVM overhead overall.

One caveat is that test classes from different modules do still need different processes for isolation. So if a process is available to run a test class but the process was spawned from a different module as that test class, the process will need to be shut down and a new one created for the new test class’s module.

If you consider this approach on our theoretical example, the execution looks something like this:

	Serial Execution	Module Sharding	Static Class Sharding	Dynamic Sharding
Time Taken	12	6	4	4
Processes	3	3	12	8

Above, you can see that first A1, A2, and B1 are scheduled and each assigned a process (dashed boxes). When A1 and A2 finish, new processes need to be spawned to run B2 and B3, but when B1 finishes the same process can run B4. Later, C1, C2, and C3 run, and when they finish we can re-use their processes for running C4, C5, and C6 respectively.

This sharing and re-use of processes is able to bring down the number of processes spawned from 12 to 8 in our theoretical example, while preserving the time taken at 4. However, 8 is still much more than the 3 processes that serial execution or module sharding needed, indicating that this approach does still add significantly process spawning overhead that the more naive approaches we saw earlier.

This difference in the number of processes spawned reflects in the practical benchmarks below:

Serial Execution

Module Sharding

Static Class Sharding

Dynamic Sharding

Netty unit tests

28s

10s

51s

21s

Mill scalalib tests

502s

477s

181s

160s

Here we can see that both the Netty unit test benchmark and the Mill scalalib benchmark both show a significant speedup using dynamic sharding over static class sharding, which can be attributed to the reduced number of processes being spawned. However, despite the speedup, the Netty unit test benchmark is still 2x slower than the more naive module sharding approach.

From the diagram above, we can see the nature of the problem: Ideally we would want A1 and A2 to share one process, B1 B2 B3 B4 to share another process, etc. But because we are scheduling test classes to run arbitrarily without regard to re-use, each thread ends up running tests from different modules rather often, with each such change forcing a new process to be spawned.

Biased Dynamic Sharding

The last piece of the puzzle is to use dynamic test sharding, but to bias the Mill scheduler to running the first test process for each module as soon as possible, and subsequent processes only later if there were no other first-processes to run.

What biased dynamic sharding does is try to minimize the number of processes each module’s test suite will run: If the scheduler has a choice between spawning a second process for ModuleA or the first process for ModuleB, it should prioritize the first process for ModuleB. This gives the existing first process for ModulaA a chance to complete its current test class and pick up the next one, without needing to spawn a second process and paying the cost of doing so.

Simulating this on our theoretical example, execution ends up looking like this:

	Serial Execution	Module Sharding	Static Class Sharding	Dynamic Sharding	Biased Dynamic Sharding
Time Taken	12	6	4	4	4
Processes	3	3	12	8	4

In the diagram above, we can see that biased dynamic sharding is able to maintain the time taken at length 4, while reducing the number of processes spawned (dashed boxes) from 8 to 4. We can see that ModuleA (red) ModuleB (green) and ModuleC (blue) are each assigned a single process to do all of its work, and only when there is a thread free (when A1 and A2 have completed) is ModuleC given the idle thread to parallelize its remaining test classes.

This is a strict improvement over the previous dynamic sharding and static class sharding approaches, and it is reflected in the practical benchmarks where both Netty unit tests and Mill scalalib tests show speedups over the previous dynamic sharding approach:

Serial Execution

Module Sharding

Static Class Sharding

Dynamic Sharding

Biased Dynamic Sharding

Netty unit tests

28s

10s

51s

21s

12s

Mill scalalib tests

502s

477s

181s

160s

132s

Notably, the Netty unit tests benchmark is now comparable to the performance we were seeing with module sharding! Although there is still a slight slowdown in the practical benchmark - presumably from the slight increase in the number of spawned processes - it is not longer the large 2-5x slowdowns we see in static class sharding and dynamic sharding. biased dynamic sharding seems to finally provide a test parallelization strategy that is flexible enough to handle widely varying workloads without the pathological slowdowns that previous strategies exhibited.

Implementation

The implementation of the various parallelism strategies we discussed above isn’t complicated: the Mill build tool is a JVM application, and all these strategies basically boil down to passing Runnables to a ThreadPoolExecutor, each one using ProcessBuilder to spawn the test runner. Different strategies have different levels of granularity for the Runnables, and different queues for the ThreadPoolExecutor (e.g. biased dynamic sharding using a PriorityBlockingQueue to bias the scheduler towards running some tasks over others) but fundamentally there’s nothing advanced going on.

Perhaps the most interesting implementation detail is for dynamic sharding: this requires the build tool to spawn a pool of test runner processes that pull the test classes off of a queue until all test classes have been completed. Mill implements this queue using a folder on disk containing one-file-per-test-class, which each spawned processes simply loops over and attempts to claim them via an Atomic Filesystem Move. This allows us to avoid the complexity of managing a third party queue system, or dealing with RPCs between different processes via sockets or memmaped files. The simple disk-based queue is more than capable of handling the relatively small-scale that a build tool test runner operates at (100-1000s of test classes).

Build Tool Comparisons

Mill is a relatively new JVM build tool, so it begs the question: how does Mill’s test runner compare to other JVM build tools like Maven, Gradle, or SBT? For this we ran the benchmarks above on the Mill example builds we used for our Maven case study, Gradle case study, or SBT case study. Although these benchmarks were rough, they should hopefully give you a good intuition for where the strategies discussed above fit into the larger build tool landscape.

Maven Comparison

The Netty project we’ve been discussing in this article is normally built using Maven: the Mill build is non-standard and used mainly as a Case Study Comparison, but that gives us an opportunity to run these benchmarks using Maven to see how it compares to the strategies discussed above. To run the same subset of unit test suites using Maven that we ran using Mill in the above examples, we used these commands, resulting in the following timings for various testing strategies:

Maven Serial

> ./mvnw -pl codec-dns,codec-haproxy,codec-http,codec-http2,codec-memcache,codec-mqtt,codec-redis,codec-smtp,codec-socks,codec-stomp,codec-xml,transport-blockhound-tests,transport-native-unix-common,transport-sctp test

Maven Parallel

> ./mvnw -T 10 -pl codec-dns,codec-haproxy,codec-http,codec-http2,codec-memcache,codec-mqtt,codec-redis,codec-smtp,codec-socks,codec-stomp,codec-xml,transport-blockhound-tests,transport-native-unix-common,transport-sctp test

Mill

Serial Execution

Module Sharding

Static Class Sharding

Dynamic Sharding

Biased Dynamic Sharding

Netty unit tests

28s

10s

51s

21s

12s

Maven

Serial

Parallel

Netty unit tests

36s

15s

Here we can see that the Mill parallel testing strategy has some speedups over the Maven build using the Maven-Surefire-Plugin. For the purposes of this comparison, we did not manage to get further speedups from setting the Maven-Surefire-Plugin’s internal parallelism configuration (link), and so did not include that in the table above.

Gradle Comparison

For another data point, we repeated the same benchmarks on the Mockito codebase. Mockito is a popular mocking framework for JVM unit tests, and its codebase is built using Gradle. Like Netty, we have an example Mill build for Mockito as a Case Study Comparison, which although not 100% complete can serve to let us compare the Mill test parallelism strategies to that of Gradle. The commands used to run the subset of the Mockito build that works on both Mill and Gradle are shown below, along with the timings:

$ ./mill test + subprojects.android.test + subprojects.errorprone.test + subprojects.extTest.test + subprojects.inlineTest.test + subprojects.junit-jupiter.test + subprojects.junitJupiterExtensionTest.test + subprojects.junitJupiterInlineMockMakerExtensionTest.test + subprojects.junitJupiterParallelTest.test + subprojects.memory-test.test + subprojects.programmatic-test.test + subprojects.proxy.test + subprojects.subclass.test

$ ./gradlew cleanTest && ./gradlew :test android:test errorprone:test extTest:test inlineTest:test junit-jupiter:test junitJupiterExtensionTest:test junitJupiterInlineMockMakerExtensionTest:test junitJupiterParallelTest:test memory-test:test programmatic-test:test proxy:test subclass:test

Mill

Serial Execution

Module Sharding

Static Class Sharding

Dynamic Sharding

Biased Dynamic Sharding

Mockito unit tests

62s

47s

139s

25s

21s

Gradle

Serial

Parallel

Parallel + maxParallelForks

Mockito unit tests

90s

56s

31s

The Gradle Serial and Gradle Parallel benchmarks were run with org.gradle.parallel configured accordingly: Gradle Serial is similar to Mill’s serial execution, while Gradle Parallel is similar to Mill’s module sharding strategies. Enabling maxParallelForks in Gradle to parallelize the tests within a subproject improves performance significantly, with numbers comparable to Mill’s dynamic sharding, although it is still significantly slower than Mill’s biased dynamic sharding strategy. From the numbers I would assume that Gradle’s parallelism approach is similar to Mill’s dynamic sharding, but it doesn’t have the same process-spawning optimizations that Mill does in its biased dynamic sharding strategy

SBT Comparison

As a last comparison, we repeated these testing benchmarks on the Gatling codebase. Gatling is a popular load testing tool written in Scala and built using SBT, and we have an example Mill build for it as a Case Study Comparison. We can run the tests for both Mill and SBT builds for Gatling using the following command, and repeating the benchmarks gives us the timings below:

$ ./mill __.test
$ sbt test

Mill

Serial Execution

Module Sharding

Static Class Sharding

Dynamic Sharding

Biased Dynamic Sharding

Gatling unit tests

27s

17s

38s

17s

12s

SBT

Serial

Parallel

Gatling unit tests

25s

12s

SBT Serial is run with fork := true spawning a process to run tests, while SBT Parallel is run with fork := false running tests in the shared JVM process. Here we can see that for this benchmark, SBT’s in-memory parallel test running has similar performance to Mill’s biased dynamic sharding. Although the Mill test runner isn’t faster, it does have nice properties: running the tests in subprocesses in sandbox folders provides a greater degree of isolation between the concurrent tests, which mitigates the risk of inter-test interference. Perhaps more importantly, Mill’s testParallelism is relatively self-tuning and should generally "just work" once enabled, whereas SBT has a number of different flags (parallelExecution, concurrencyRestriction, fork, testForkedParallel) that need to be configured together which can be tedious and error-prone.

Conclusion

It’s interesting how similar the problem of parallelizing tests is to the challenge of architecting any distributed system. The ideas of static sharding and dynamic sharding should be familiar to any backend or infrastructure engineer, and the same tradeoffs that apply to their use in backend systems also apply to their use in a build tool’s test runner. It’s also surprising how much detail there is when trying to "parallelize unit tests": not only throwing the work at a thread or process-pool, but also managing the lifetimes, re-use, and scheduling of heavyweight JVM test processes in order to provide good performance across a wide variety of workloads.

The Mill build tool’s test parallelism strategy has gone through a lot of iterations and improvement over the years, but at this point it is in a pretty good state. Most importantly, Mill’s new def testParallelism = true flag is a single switch. It is not that much faster than running tests with older strategies or other build tools, but it provides that performance across a wide range of workloads without needing manual tuning or configuration. This simplifies the user experience, letting them spend less time fiddling with the build tool and more time focusing on their actual project. While testParallelism it is opt-in for testing in the latest Mill 0.12.9, we expect to make it the default (with an opt-out) in the next major version of Mill 0.13.0.