Deterministic Simulation Testing - Antithesis
The Deterministic Computer
For those of you that are new here, welcome! I’m a partner at General Catalyst focused on AI, infra, cyber, etc., at the earliest stages. You can also find me on X here: x.com/alex__mackenzie. Thank you to the ~1.6k of you following along 😄
Oh & if you could do me a favour and engage with this X post, it goes a long way. Enjoy reading!
Ever since reading Tristan Hume’s (if anyone knows Tristan, I’d love to say hi sometime) piece whilst at Jane Street on Magic Trace, I’ve closely --watch’d JS’ engineering blog. Their bet on OCaml, Advent of FPGA, and Byrne Hobart’s “Understanding Jane Street” piece collectively made me rather JS-pilled.
So yeah, when I noticed that Jane Street was leading Antithesis’ latest funding round, my ears pricked up. Great infra companies frequently stem from mega-scale software companies: Confluent/Kafka from LinkedIn, Chronosphere, Temporal/Cadence from Uber, ClickHouse from Yandex, CockroachDB from Google, PlanetScale/Vitess from YouTube, etc. I suspect great infra companies are also invested in by mega-scale software companies; when the companies are actual users of the infra vs. corp dev shops, of course. This also played out with D. E. Shaw’s early bet on Flox.
Turns out, elite testing strategies (like the ones Antithesis incorporates) require some pretty interesting eng work. The kinda work that tends to go down well in this blog. Alright, let’s bash some bugs.
As always, we’ll begin with a definition which we’ll then break down bit-by-bit:
Antithesis uses, property-based testing, fuzzing, fault-injection and deterministic simulation, rather than traditional software testing methods. It’s an autonomous testing platform that finds bugs in your software, with perfect reproducibility to help you fix them.
Antithesis performs each test run in a simulated environment, which they provision & manage automatically. This environment contains your entire service architecture, and uses virtualization to simulate all hardware & network components in the system. The year of the sandbox, anyone? Customer production systems are never involved, so intense fault-injection can be performed with no risk of downtime.
Anyway, we’re getting a bit ahead of ourselves, let’s dig in. What’s so bad about good ole traditional software testing anyway?
The Problem with Traditional Software Testing?
Traditional software testing is largely comprised of “unit tests” and “integration tests” (using libraries like pytest, Jest, JUnit, etc). They’re known as “example-based testing”. Many of you will be all too familiar with these variants of testing, buuut for the uninitiated, let’s look at unit testing. Quickly.
Unit tests test if a specific “unit” of logic does what we expect. Let’s say we’ve created a “sort_number( )” function that, when given a list of numbers (e.g. [9, 12, 2, 6]), should return: [2, 6, 9, 12] . Oh, and before you all say it, yes I know python comes with sorted( ) out of the box. Anyway, we’d “unit test” this function like so:
# this is the actual function we will test
def sort_number(numbers):
result = numbers.copy()
for i in range(len(result)):
for j in range(i + 1, len(result)):
if result[j] < result[i]:
result[i], result[j] = result[j], result[i]
return result
# the tests
def test_sort_number():
assert sort_number([9, 12, 2, 6]) == [2, 6, 9, 12]
assert sort_number([]) == []
assert sort_number([1]) == [1]
assert sort_number([5, 5, 5]) == [5, 5, 5]
assert sort_number([-3, 0, 2, -1]) == [-3, -1, 0, 2]
assert sort_number([1, 2, 3]) == [1, 2, 3] # already sorted
assert sort_number([3, 2, 1]) == [1, 2, 3] # reverse sortedWhat we’re doing in our “test_sort_number( )” function is manually (this is important) specifying that when we pass a given list (e.g. [9, 12, 2, 6]) into our “sort_number( )” function, that it should equal a certain output, in this case, the output should be [2, 6, 9, 12]. If this isn’t the case, then our unit test fails and we’ll receive an error.
Anyway, I’m not really here to explain unit testing to you. But, I am here to assert its flaws. Particularly in distributed systems. Firstly, unit tests are a pain in the ass to write. You are literally hardcoding inputs and outputs. What this also means is that you need to list, via examples, all of the ways in which your function could fail. The problem is, you might not think of example data that covers all failure modes.
Wouldn’t life be easier if we could stipulate the desired end states, or “properties”, of our function’s outputs, and have tests be automatically generated to assure this end state? If this sounds confusing, it really isn’t, so bare with me.
For example, one desired end state of our “sort_number( )” function could be that our output length should equal our input length. I.e. if we receive an array with three numbers, our function should output an array containing three numbers. Cool, we can generate a bunch of tests to make sure this is the case. Similarly, we want to ensure that our “sort_number( )” function’s output contains the exact same numbers that it receives as an input, just sorted of course.
This “property”-based approach to testing is called, shock horror, “Property-Based Testing”, or “PBT”, if you wanna look cool in front of your eng homies. Like many of my favourite eng movements (i.e. Nix), PBT stems from the functional programming community via Haskell’s QuickCheck. The primary PBT library in Python is called Hypothesis. Let’s revisit the properties we mentioned we’d like our “sort_number( )” function to have and implement them via Hypothesis.
from hypothesis import given
from hypothesis.strategies import lists, integers
@given(lists(integers()))
def test_sort_properties(nums):
result = sort_number(nums)
# Same length
assert len(result) == len(nums)
# Same elements
assert sorted(result) == sorted(nums)Nice. Now, by using Hypothesis, we will automatically generate input & output data (i.e. example-based tests) at runtime that we previously had to hand write. It’ll also bias toward edge cases (empty, single element, negative, large numbers, duplicates, etc.). Now I know testing isn’t necessarily the “hottest” category of developer tools, but c’mon, that is pretty sweet. Antithesis takes this PBT methodology, but applies it to how your entire system should work. Don’t stress about this yet, we’ll return to it.
The Challenge With Distributed Systems?
When you have a distributed system (multiple services, databases, network calls), bugs often depend on precise timing. This is known as a race condition. For example, a race condition might only trigger when server A’s request arrives exactly 3 milliseconds before server B’s, whilst the database is mid-write, during a network blip. You could literally run your test suite a thousand times and never hit this bug. Sure, us mere mortals can conjure simple unit tests, but something this complex? Tricky. This is why production is a different beast to development / staging environments.
As Doug Patti at Jane Street puts it: “testing is still hard. It takes time to write good tests, and in any non-trivial system, your tests are an approximation at best. In the real world, programs are messy. The conditions programs runs under are always changing: user behavior is unpredictable, the network blips, a hardware failure causes a host to reboot. It’s inherently chaotic. That’s the hard thing about developing HA systems: for all the careful tests that you think to write, there are some things you can only learn by experiencing that chaos. That’s what it takes to go from merely being tested to being battle-tested.”
Nice. Now, I should point out that “regression testing” (see something bad happen in production & test for it in the future) has been a testing methodology for some time. But it’d be preferred if, yeno, the bad thing never happened in the first place? We also have some mitigations for distributed systems’ non-determinism like: locks/mutexes, atomic operations, and message passing, but they can create performance overhead & are still be difficult to reason about in advance of deploying to production.
Antithesis - the Deterministic Computer
Antithesis takes being “battle-tested up” several notches from regression testing etc. Per Jane Street: “The amazing thing that Antithesis does is run your whole system in a virtual machine controlled by a completely deterministic hypervisor, and then adds a little manufactured chaos by interfering with scheduling and networking. It uses this setup to explore many different scenarios, and to discover circumstances where your system might fail.”. Alright, now this is something for us to unpack 😄
Aware the above sounds complicated, but don’t fret. The virtual machine part is easy to grok: Antithesis clones your distributed system (e.g. your food delivery app) & runs it on their own servers (vs. your users’!). It does this as it wants to mimic your system running in production. It can then replicate this system n-many times, across n-many scenarios (we’ll get to this).
Most importantly, these virtual machines (“VMs”) are controlled by a deterministic hypervisor. This hypervisor ensures that your distributed system has a consistent set of properties. To make this more concrete, let’s revisit the race condition we described above. When not using Antithesis, our “normal” production system may work like so:
Monday 2pm - Everything works fine:
- Server A request: arrives at 14:00:00.100
- Server B request: arrives at 14:00:00.250 (150ms later)
- Database write: starts at 14:00:00.050, finishes at 14:00:00.120
- Network: stable
Result: ✓ Works perfectly
Tuesday 9am - Same exact user actions:
- Server A request: arrives at 09:00:00.100
- Server B request: arrives at 09:00:00.103 (3ms later - just unlucky timing!)
- Database write: starts at 09:00:00.098, still running at 09:00:00.103
- Network: small blip causes 5ms delay
Result: ✗ BUG! Data corruption!
Wednesday 3pm - Trying to reproduce:
- Server A request: arrives at 15:00:00.100
- Server B request: arrives at 15:00:00.180 (80ms later)
- Database write: starts at 15:00:00.095, finishes at 15:00:00.115
- Network: stable
Result: ✓ Works fine again
& herein lies the issue: you know there’s a bug, but oftentimes you don’t know what’s causing it. Was it our network blip? Was it the server request timing? Perhaps it was the presence of both? etc. Hence, the bug becomes difficult to reproduce. Our system is rather inconsistent. Gah!
Hopefully it’s now clear why we’d like for our system to have a guaranteed set of properties. If we can reproduce this 3ms latency and network blip consistently, and each time they’re present we note that our data corruption bug occurs, then we’ve likely found the problem. Nice.
By contrast, using Antithesis’ deterministic hypervisor, our system would run like the below. These are separate runs, but identical. Note - this is an oversimplification, but illustrative.
Monday 2pm - Running Scenario #1 in Antithesis:
- Server A request: arrives at T+0.100
- Server B request: arrives at T+0.250 (150ms later)
- Database write: starts at T+0.050, finishes at T+0.120
- Network: stable
Result: ✓ Works perfectly
Tuesday 3pm - Replaying Scenario #1 in Antithesis:
- Server A request: arrives at T+0.100 [IDENTICAL]
- Server B request: arrives at T+0.250 (150ms later) [IDENTICAL]
- Database write: starts at T+0.050, finishes at T+0.120 [IDENTICAL]
- Network: stable [IDENTICAL]
Result: ✓ Works perfectly [IDENTICAL]
Friday 10am - Replaying Scenario #1 in Antithesis:
- Server A request: arrives at T+0.100 [IDENTICAL]
- Server B request: arrives at T+0.250 (150ms later) [IDENTICAL]
- Database write: starts at T+0.050, finishes at T+0.120 [IDENTICAL]
- Network: stable [IDENTICAL]
Result: ✓ Works perfectly [IDENTICAL]Nice. As Doug Patti said, based on the determinism that Antithesis gives us, we can “[use] this setup to explore many different scenarios, and to discover circumstances where your system might fail.” I.e. Antithesis will clone your distributed system across a number of VMs, and then autonomously start changing certain properties (e.g. 3ms latency could become 3.5ms) and purposefully inject issues like network blips (“fault injections”) in order to test your distributed system. Cool. Have you noticed that this autonomous testing is just property-based testing? The pieces are coming together.
If your system fails, no worries, Antithesis will help you reproduce the bug and, ultimately, bash it. This readers, is “Deterministic Simulation Testing”.
How Does This Work? The Thesis Behind Antithesis
Ok, I think it’s important to take a breather and acknowledge that you should now get the value prop of Antithesis. That’s great. But, how on earth does Antithesis actually achieve this determinism that’s evidently so hard? Guys, it’s honestly so smart.
You see, infrastructure always comes with tradeoffs. The miracles that are modern computers (i.e. the servers running production code) are optimised for speed and throughput, not predictability. Whether it’s CPU schedulers, caching, multi-core parallelism, etc., they’re doing their darnedest to make your software fast. But, this means that if your system can run 8ms faster on Tuesday vs. Monday, it will. Hence, it’s non-deterministic.
Instead of trying to recreate production’s chaotic timing (based on actual milliseconds ticking by on a clock), Antithesis creates a controlled environment where time works fundamentally differently. Don’t worry, you’re not reading Dark Matter, this is simply cool comp sci.
Time flows in Antithesis not via milliseconds, but by CPU instruction count. Whether it’s 2:45pm on Tuesday or 5:15am on Monday is irrelevant, instruction number 5,000 is instruction number 5,000. Our new unit of “time” (or “simulated clock”) removes the variability inherent in production systems.
This is important, as software applications often make timing-based decisions (timeouts, rate limiting, cache expiration). If your code sees different time values on different runs, it will behave differently. Antithesis’ “virtual time” ensures that your code always sees the same values. This took me quite a bit to grok, so let me walk through an incredibly simplified example.
start = time.now() # gets us the actual time per our computer's clockWhen the above system call is made, Antithesis intercepts it, and instead, converts the instruction count into a virtual time value in milliseconds. Not sure how they actually calculate this conversion, but for simplicity’s sake, let’s pretend they divide the count by 100. In this case, instruction count 5,000 always outputs 50 milliseconds. It ultimately converts to milliseconds as our original code (the time.now( ) function) was written to process milliseconds, not instructions.
Nice. Now, remember the multi-core parallelism I mentioned above? Well, much like our friend “time”, it also creates non-determinism issues. Why? Well, once you have multiple cores each processing separate threads (i.e. CPU instructions), these threads can execute in essentially infinite different orders depending on which core happens to run faster at any given nanosecond. Of course, Antithesis has an answer here too.
Within Antithesis, each instance of the deterministic hypervisor runs on just one physical CPU core (vs., say, 48 cores). This is totally fine, as this isn’t our production environment, we don’t need the speed benefits of multi-core processing.
Alright, for those folk interested, there’s a lot more Antithesis does (e.g. deterministic I/O) to assemble their “deterministic computer”, but, I think at this point, you get my drift. The reality with all great infra is that it’s not a single “silver bullet” idea that solves the underlying problem, but a series of micro-optimisations that compound.
Incredible work team Antithesis, and congrats on your Series A! time.now( ) to get back to work 😄
Disclaimer





Really excellent breakdown of deterministic simulation. The shift from wall-clock time to instruction count is elegent because it sidesteps the entire non-determinism problem rather than trying to control it. The property-based testing angle makes this even more powerful since your actually testing behaviors not just examples. I worked on distributed systems at a fintech and we would have killed for this level of reproducability when debugging race conditions.