The Virtue of Busy Waiting

October 21, 2019

Sleepy, flaky tests

We all sometimes come across the dreaded Process.sleep/1 function in Elixir test code. You might also encounter its older brother, :timer.sleep/1, but you surely know that these two functions are both after the same thing: they actively work to undermine your Red-Green-Refactor state of flow, and aim to destroy your test suite’s determinism.

We feel that even one call to Process.sleep() in a test suite is one too many, and we’d like to showcase an approach which we use to clean up “sleepy” tests. This approach will help you regain your flow, while removing non-determinsm from your tests. But also: it will produce the pleasant side effect of forcing you to think more systematically about the properties of the system you’re testing.

Let’s solve the “sleepy test” problem.

Let’s just doze off for a few moments here…

The systems that we work with nowadays are de facto distributed systems, even if we don’t consciously design them that way. Caches, databases, job queues, metrics and log aggregators, third-party services, as well as our own client devices all comprise “our application”, and as such, require testing.

Many of these subsystems are asynchronous: writing a document to an eventually-consistent store doesn’t guarantee it will be there when we subsequently query for it; setting up a small job in our async pipeline doesn’t mean the results will be ready for us right away – even if the task itself is trivial.

And so it happens that when we write our tests, we do the laziest thing imaginable: rest and doze off a bit.

given_i_insert_a_document(db, %{"title": "test_doc", "content": "abc"})
Process.sleep(250) # wait for db to persist & replicate
x = when_i_query_by_title(db, "test_doc")
assert x == {:ok, %{"title": "test_doc", "content": "abc"}}

We fine-tune our sleep time to be “just enough” for the db to do the necessary paperwork, and we proceed with writing subsequent test cases, happy with our “simple” solution. This deceptively simple solution usually works. It works well enough for us to get productive and forget about why we needed it in the first place.

Things are peachy until we start noticing two worrying occurrences.

Sometimes our assertions fail with errors like: {:error, "title 'test_doc' not found in db"}
Our test suite is taking longer and longer to finish

The first problem comes from the fact that we can’t be certain that the database will indeed finish processing and replicating our data within the 250 milliseconds that we specified. Sometimes it might take 259ms for our test record to become available. If we’d allowed for 10 extra milliseconds of sleep in our test, it would have passed. Of course, we know that this is just a spurious timing issue, so we’ll re-run the failing test.

This fact represents “secret knowledge” about the system: it will cause problems on CI servers, overloaded machines (we’ve all tried running our test suites while videoconferencing) and it will cause problems when onboarding new developers. It also reduces our confidence in our test suite. A failing test should reliably indicate an issue with the system under test. If it doesn’t, its value as a test is greatly diminished.

The second problem is that as our test cases increase in number, the little moments of time that our computer spends sleeping begin to compound. make test starts taking longer and longer, leading developers to find shortcuts, run tests less often, or write tests after writing the code. All worrying signs. Ideally, writing a test should be easier and more fun than writing the implementation, to incentivize the developers. Tests that take ages to run are not fun, and only incentivize us not to run them.

Fortunately, the solution to both of these problems is easy and fun.

Introducing: Polling

If you’re a parent and own a car, you’re probably familiar with the staple of family road trips: “Are we there yet?”. If you’re not, you might recall posing the question yourself, in your youth, to the great irritation of your parents. Luckily, the computer won’t get annoyed with us asking about the same thing over and over again. Let’s take advantage of this fact, and write some code to do the asking for us.

defmodule Wait do
  def until(p) do
    p.() || until(p)
  end
end

We can now call Wait.until with a predicate function p, and if this function returns false, we’ll happily invoke it again, repeatedly, until we get the answer we’re looking for. “ARE WE THEEEERE YET??”

With this pattern, we can continue asking the question for as long as it takes. For example, here’s how we could wait until the virtual machine generates a high-enough unique integer:

Wait.until(fn ->
  IO.inspect(System.unique_integer([:positive])) > 100
end)

This patten is called busy-waiting, because instead of sleeping quietly and preserving resources, we keep the computer busy with calls to our predicate function. In most production systems, this is an antipattern you want to avoid due to its extreme wastefulness.

In test code, however, we shouldn’t be concerned with implementation efficiency, but with correctness, determinism, and turnaround time. Busy-waiting helps immensely with all of these, as we hope to demonstrate below.

Here are some patterns where we apply our Liveness library to make our tests more reliable and faster, while eliminating hand-written sleep statements.

Example 1: Ensuring resource availability

Let’s say we’re testing an external resource that’s represented by a GenServer. This resource is a singleton process, registered with the system under a specific name.

defmodule ExternalAdder do
  use GenServer
  require Logger

  def start_link do
    GenServer.start_link(__MODULE__, [], name: __MODULE__)
  end

  def add(x,y), do: GenServer.call(__MODULE__, {:add, x, y})

  def init([]) do
    Process.flag(:trap_exit, true)
    {:ok, :no_state}
  end

  def handle_call({:add, x, y}, _, state) do
    {:reply, x + y, state}
  end

  def terminate(_, _) do
    Logger.info("closing at #{DateTime.utc_now()}")
    :ok
  end
end

Below, we have a test suite…

Code.require_file("external_adder.exs")
ExUnit.start()

defmodule ExternalAdderTestNoSleep do
  use ExUnit.Case

  test "addition identity" do
    {:ok, _} = ExternalAdder.start_link()
    assert ExternalAdder.add(1,0) == 1
  end

  test "addition commutativity" do
    {:ok, _} = ExternalAdder.start_link()
    assert ExternalAdder.add(1,2) == 3
    assert ExternalAdder.add(2,1) == 3
  end
end

…that will often fail with the following message:

$ elixir ./external_adder_test_no_sleep.exs

  1) test addition identity (ExternalAdderTest)
     external_adder_test_no_sleep.exs:12
     ** (MatchError) no match of right hand side value: {:error, {:already_started, #PID<0.115.0>}}
     code: {:ok, _} = ExternalAdder.start_link()
     stacktrace:
       external_adder_test_no_sleep.exs:13: (test)



Finished in 0.1 seconds (0.1s on load, 0.00s on tests)
2 tests, 1 failure

Why is this suite flaky?

When the first test finishes, it sends an exit signal to the GenServer process registered under the name ExternalAdder. However, this signal is sent asynchronously, and the ExternalAdder process gets to execute its terminate callback function. In our case, this involves logging a diagnostic message. As it turns out, it often happens that the terminate callback is still executing when the second test case is started.

The call to ExternalAdder.start_link() now fails, because the GenServer registered in the previous test case is still in the process of terminating. The BEAM runtime will not let us start another process with the same name until the previous one is successfully unregistered. (Read more about registering and unregistering here.)

We know for a fact that a new ExternalAdder should start up at some point, because the old one was linked to the previous test case process, which is done executing. We are running into a race condition which we could simply wait out. Let’s take this approach and see if it works:

Code.require_file("external_adder.exs")
ExUnit.start()

defmodule ExternalAdderTestSleep do
  use ExUnit.Case

  test "addition identity" do
    Process.sleep(10)
    {:ok, _} = ExternalAdder.start_link()
    assert ExternalAdder.add(1,0) == 1
  end

  test "addition commutativity" do
    Process.sleep(10)
    {:ok, _} = ExternalAdder.start_link()
    assert ExternalAdder.add(1,2) == 3
    assert ExternalAdder.add(2,1) == 3
  end
end

Indeed! The barely noticeable sleep of 10 milliseconds at the start of each test case gives the terminating ExternalAdder enough time to print out its diagnostic message and unregister. Now the entire test suite passes.

All is fine and dandy… or is it? Let’s simulate several days’ worth of development and testing by running our test suite in a loop:

$ while (elixir ./external_adder_test_sleep.exs); do sleep 1; done

This will run the tests until they fail. And sure enough, after several tries, we hit the error from before:

.
11:38:25.281 [info]  closing at 2019-10-22 11:38:25.266176Z


  1) test addition identity (ExternalAdderTestSleep)
     external_adder_test_sleep.exs:7
     ** (MatchError) no match of right hand side value: {:error, {:already_started, #PID<0.116.0>}}
     code: {:ok, _} = ExternalAdder.start_link()
     stacktrace:
       external_adder_test_sleep.exs:9: (test)



Finished in 0.06 seconds (0.03s on load, 0.03s on tests)
2 tests, 1 failure

If you’ve used Process.sleep to wait out race conditions in your tests before, you’ll know what this means. From time to time, the shutdown process of our GenServer happens to take longer than 10ms. Let’s give it a more generous allowance of 50ms.

Code.require_file("external_adder.exs")
ExUnit.start()
defmodule ExternalAdderTestSleep50ms do
  use ExUnit.Case

  test "addition identity" do
    Process.sleep(50)
    {:ok, _} = ExternalAdder.start_link()
    assert ExternalAdder.add(1,0) == 1
  end

  test "addition commutativity" do
    Process.sleep(50)
    {:ok, _} = ExternalAdder.start_link()
    assert ExternalAdder.add(1,2) == 3
    assert ExternalAdder.add(2,1) == 3
  end
end

Now, this test suite seems to work everytime – we can leave it running in a loop and it keeps passing. If this was just one isolated test case, it would be an OK solution, but it doesn’t give us great confidence in the test passing on another developer’s overloaded machine, or a CI server.

Given enough time, this test suite will fail.

Let’s see how we can use a busy loop to remove the sleeps from our test and gain confidence in its reproducibility.

Imagine if we could say: “I expect this expression to eventually run successfully, even if it fails a couple of times before”. Now, let’s see how we would use it to state that we want our ExternalAdder to be eventually start_linked successfully.

eventually(fn -> {:ok, _pid} = ExternalAdder.start_link() end)

It turns out we can do just that! The Liveness library provides our higher-order eventually function, which will take its input and try executing it multiple times, until it succeeds.

Code.require_file("external_adder.exs")
ExUnit.start()
defmodule ExternalAdderTestEventually do
  use ExUnit.Case
  import Eventually

  test "addition identity" do
    eventually(fn -> {:ok, _} = ExternalAdder.start_link() end)
    assert ExternalAdder.add(1,0) == 1
  end

  test "addition commutativity" do
    eventually(fn -> {:ok, _} = ExternalAdder.start_link() end)
    assert ExternalAdder.add(1,2) == 3
    assert ExternalAdder.add(2,1) == 3
  end
end

Now, we can have confidence that our tests will pass in all kinds of situations, including on overloaded dev machines or CI servers.

Example 2: Too many sleeping tests

Let’s suppose that our development team is happy with the “good enough” solution of sleeping to wait out race conditions, and they’ve all agreed that a test suite failing every now and again is an acceptable price to pay.

Even if we’re happy with flaky and nondeterministic tests (which we shouldn’t be), there is another danger lurking here.

Here’s what happens when the number of “sleepy” test cases increases in our codebase.

Code.require_file("external_adder.exs")
ExUnit.start()
defmodule ExternalAdderTest100CasesNoSleep do
  use ExUnit.Case

  for i <- 1..100 do
    @tag number: i
    test "addition identity for #{i}", %{number: n} do
      Process.sleep(50)
      {:ok, _} = ExternalAdder.start_link()
      assert ExternalAdder.add(n,0) == n
    end
  end
end

Finished in 5.3 seconds (0.2s on load, 5.1s on tests)
100 tests, 0 failures

It works, but 100 simple tests took over 5 seconds to complete! And remember, we still don’t have any guarantees that at some point, some of these tests won’t randomly fail, for instance due to CPU time stolen by the hypervisor.

Let’s see if the situation would improve if we used eventually instead:

Code.require_file("external_adder.exs")
ExUnit.start()
defmodule ExternalAdderTest100CasesEventually do
  use ExUnit.Case
  import Eventually

  for i <- 1..100 do
    @tag number: i
    test "addition identity for #{i}", %{number: n} do
      eventually(fn -> {:ok, _} = ExternalAdder.start_link() end)
      assert ExternalAdder.add(n,0) == n
    end
  end
end

Finished in 0.5 seconds (0.2s on load, 0.2s on tests)
100 tests, 0 failures

Much better. Remember that here we’re using the eventually default of 250 attempts, every 20 milliseconds.

To gain more speed, we can try to decrease the polling interval and increase the number of attempts.

Code.require_file("external_adder.exs")
ExUnit.start()
defmodule ExternalAdderTest100CasesEventuallyShortInterval do
  use ExUnit.Case
  import Eventually

  for i <- 1..100 do
    @tag number: i
    test "addition identity for #{i}", %{number: n} do
      eventually(fn -> {:ok, _} = ExternalAdder.start_link() end, 1_000, 5)
      assert ExternalAdder.add(n,0) == n
    end
  end
end

Finished in 0.4 seconds (0.3s on load, 0.1s on tests)
100 tests, 0 failures

An extra 20% faster! We’ve gone from a rather-untrustworthy, slow test suite to a deterministic, fast test suite.

Example 3: Assertion on eventually consistent results (e.g. db/queue reads)

Test assertions are where the eventually pattern really shines. Very often we come across situations where we’re testing an asynchronous or eventually consistent system.

As part of the test, we perform some actions, and then we check that these actions had the intended effect. For example, we trigger an ansynchronous job via an API request, and we expect the job result to land in an S3 bucket.

This type of test often involves excessive sleep statements, put in place by different developers working on different machines. As with the previous examples, these tests are often flaky as well – there’s no guarantee that sleeping for N seconds will be enough for our system to process the data and upload results to a remote file server.

On top of that, the tests take longer and longer the more cases we need to support.

To demonstrate this, let’s build a very simple asynchronous system using built-in OTP functionality. The system accepts two numbers that we wish to add, and it eventually writes out the result of their addition to a new file. The caller is responsible for reading the result from the file.

defmodule AdditionJob do
  use GenServer
  @flush_timeout 500 # Change me to >600 and see the sleep_test fail!

  def start_link(), do: GenServer.start_link(__MODULE__, [])

  def schedule_addition(pid, x, y), do: GenServer.call(pid, {:add, x, y})

  def init([]), do: {:ok, :no_state}

  def handle_call({:add, x, y}, _, state) do
    {output_filename, 0} = System.cmd("mktemp", [])
    output_filename = String.trim(output_filename)
    handle = File.open!(output_filename,
      [:write, {:delayed_write, 64, @flush_timeout}])
    IO.binwrite(handle, "#{x + y}")
    {:reply, {:ok, output_filename}, state}
  end
end

Eventual writes are simulated using the {:delayed_write, 64, @flush_timeout} option passed to File.open!. The Erlang VM will buffer writes until the buffer size exceeds 64 bits or @flush_timeout milliseconds have elapsed since the buffered writes have been flushed to disk. Our system will use a @flush_timeout of 500 ms.

Here’s how we’d test this system with a sleep statement:

Code.require_file("addition_job.exs")
ExUnit.start()

defmodule AdditionJobSleepTest do
  use ExUnit.Case

  test "addition result is eventually available in output file" do
    # given
    {:ok, task} = AdditionJob.start_link()

    # when
    {:ok, output_file} = AdditionJob.schedule_addition(task, 1, 2)

    # then
    Process.sleep(600)
    assert File.read(output_file) == {:ok, "3"}
  end
end

When we run it, we can see that it indeed works and reads the correct result from a file, and takes 0.6 seconds to run. Let’s see what it would look like if we used Liveness.eventually.

Code.require_file("addition_job.exs")
ExUnit.start()

defmodule AdditionJobSleepTest do
  use ExUnit.Case
  import Eventually

  test "addition result is eventually available in output file" do
    # given
    {:ok, task} = AdditionJob.start_link()

    # when
    {:ok, output_file} = AdditionJob.schedule_addition(task, 1, 2)

    # then
    assert eventually(fn -> File.read(output_file) == {:ok, "3"} end)
  end
end

We’ve removed the sleep statement and replaced it with a more deterministic (and descriptive!) assertion:

assert eventually(fn -> File.read(output_file) == {:ok, "3"} end)

but apart from this, we don’t seem to have gained much. With both this test assertion and the previous one, we need to wait at least 500 milliseconds for the results to show up in our output file.

However, the benefits of using an eventual assertion will become apparent to us further down the road. Here’s one such situation.

Suppose the specification of our AdditionJob changes, and the eventual write now takes a bit longer to complete. Let’s say we have to bump the @flush_timeout to 750 ms. If we do this, all our test cases that depend on a hard-coded sleep of 600 ms will fail, and we’ll have to go through them and change them.

Meanwhile, our eventual assertion keeps on working. The assertion says that eventually the results will be in the file, and that fact about our system has not changed, despite changes in its internal parameters.

`Liveness.eventually` asserts Liveness Properties

The correctness of programs can be checked using a conjuction of safety and liveness properties, as proposed by Leslie Lamport.

Informally, the safety properties of a system define states that the system is not allowed to end up in. On the other hand, liveness properties say that even if the system is not in a given state yet, there’s a possibility that it will reach it at some point in the future.

In the language of the test cases we write to accompany our software, every time we use assert, we are verifying safety properties, because we have some expectation about the state of the system in the given moment. More explicitly, we say: “this is the permissible state of the system now. All other states mean that this safety property has been violated and the test must fail.”

When we compose assert with eventually, we are declaring something like the following: “the system may be in whatever state at the moment, that’s ok. But we expect the system to reach this desired state at some point in the future.” This brings us into the domain of liveness properties.

Of course, the eventually combinator will only keep retrying for so long, so it’s possible that we won’t witness our system reaching the desired state before our time runs out. This is a false negative, and a known fact about practically checking liveness properties: you can’t know for sure that the system can reach a desired state until you actually see the system in that particular state. Model-based & formal methods offer other ways of verifying liveness properties, but these are far outside the scope of ExUnit.

The `#sleeplesstesting` challenge

If you’re sick of flaky and sleepy tests in your Elixir projects, how about taking the #sleeplesstesting challenge? We dare you to grep your test directories for occurrences of the word sleep, and see what happens when you replace sleeps with declarative busy waits.

Ping us on twitter and tweet your results with the hashtag #sleeplesstesting!