Let’s talk about invisible tech debt! Vol. 1: Expressive tests and test error messages

In my previous article, I discussed tech debt in somewhat general terms and why we need to talk about it more in multidisciplinary teams. And in particular, I talked about one subset of tech debt which is particularly hard to identify: the invisible tech debt that acts as a gradual drag on your teams’ productivity but never makes it onto work tickets.

I ended my previous article with the conclusion that you should just ask your developers where the pain is, to work out how to help them fix that pain, and offered some specific questions.

In this article I’ll drill down further, into one very specific case of this sort of invisible tech debt, including a handful of examples from my last few projects where I was able to spend some time fixing the issues causing it.

Wanting “good tests” is something everyone agrees on

You may remember my previous caricatures of devs and clients wanting different things from the codebase. But you may also have noticed that “good tests” are one of the few things that everyone agrees on, albeit to different extents and for different reasons. That makes tests a great place to consider expanding the scope of the tech debt you choose to address.

One specific case of “tech debt in tests” I particularly care about is expressive tests and test error messages (or rather, their absence).

Tests’ code should make it clear what they expect to be true and why

That’s not a very contentious statement, so I won’t spend much time on it. But there are two important corollaries which are more often overlooked:

The error messages reported when tests fail should make it clear what went wrong, what was expected, and why the actual output was different

If a test fails due to an unforeseen exception in the system code, then the exception details that make it back to the top layer of the test should be, or include, details of the original exception that caused the problem

When put like this, my statements might all seem pretty obvious 😄, but it’s remarkably easy for small edge cases to creep into a test suite where they stop being true.

Inexpressive error messages are invisible tech debt

Why are inexpressive error messages a form of invisible tech debt? Because when the principles above stop being true… … …nothing breaks.

The tests don’t fail (except when they should, due to some other underlying bug in new code)
Tickets aren’t badly blocked, because the devs can still dig down into the error to find out where the real error was
The users don’t see any issues, because the bugs that triggered these unclear messages still get fixed before being deployed
The production logs don’t even fill up with error messages, because this is only in the testing code and doesn’t get deployed

That’s why this problem is “invisible”. But why is it actually a problem? Is this really tech debt at all?

Well, the problem was hidden in the middle there:

> devs can dig down into the error to find out where the real error was

That’s extra time your developers spend trying to find out what broke, which they shouldn’t have to spend, and which is slowing them down from delivering their feature. The information they needed should have been provided front-and-centre, not “available if you spend the time to excavate it”.

That’s why it’s tech debt! And specifically, the flavour of invisible tech debt I’m talking about today.

I’ll briefly talk about where these sorts of problems come from, and cover some easy advice on how to set up your tests from the start to minimise the chances of this sort of tech debt developing. Finally, I’ll share three stories of recent problems I’ve had with inexpressive tests in the wild. Sometimes I’m able to quantify the impact to our project and client, but other times not, which should arguably be even more worrying – unknown unknowns aren’t safe because you’ve no reason to think they’re a problem; they’re dangerous because you can’t tell how big a problem they might be.

How to write expressive tests

I promised to touch briefly on this, so I will. Very roughly, it should be clear to devs what a particular test is doing, when they read its code. To achieve this, you want to focus on

Specific, descriptive test names
Clear separation of test phases
- “Arrange; Act; Assert” is a common mnemonic
Clarity & consistency on the relevant parts of the setup for any given test
- We’ve written and open-sourced a dedicated .NET library for this: LochNessBuilder
Readability of the assertions after the code has run
A record of why the expected output is expected

Test error messages should be expressive too

A thorough automated integration test suite allows cross-cutting refactors to be made confidently and quickly. But for your developers to use your test suite efficiently, it’s critical that your test suite always gives clear, expressive error messages.

For unit test suites, this normally happens automatically: the narrow scope and isolation encourage it. But for integration test suites, things can be a little less straightforward, for two broad reasons:

Overly simplistic assertions
Exception details getting swallowed or obfuscated

I’ll outline these below and give three examples for the second.

Common problems causing inexpressive test error messages

Overly simplistic assertions cause invisible tech debt

The happier case of a test failure is that the code you were testing ran to completion but gave the wrong output.

Every language will have its standard test library/framework, which provides the basic infrastructure and execution platform for your suite of tests. That test suite will certainly come with assertion methods to check the values of all the primitive types in your language, for example: “Does this number equal 27?”, “Is this expression true?”, and “Is this string the same as that string?”. Using those basic assertions will get the job done, but they aren’t very expressive – they’ll only tell you “Yes or No; this value was exactly what I expected”.

A more powerful assertion library will offer much more detail on what the output was and give easy access to more subtle assertions on objects like collections, and dictionaries and dates – the sorts of outputs we normally get when writing integration tests.

For concrete examples, in .NET I exclusively use FluentAssertions as my assertion library. It provides a huge suite or assertion methods such as:

testNumericOutput.Should().BeInRange(min, max)
testListOutput.Should().NotContainNulls()
testListOutput.Should().AllSatisfy(someCondition)
testStringOutput.Should().StartWith(thisWord)
testStringOutput.Should().MatchRegex(expectedPattern)

Every one of those could be written out as code that calculates the conclusion of the logical check, and then asserts the result it true, for example Assert(output < min && output > max) rather than .BeInRange(). Each of the helpers makes the test much more readable than the written-out version, as well as being quicker to write and less prone to errors (did you even notice that the signs were the wrong way round in that Assert?). But much more importantly each of those methods carries a much more detailed error message formatter – if those assertion methods fail, they express much more detail about what the value was and what value was expected, not just “the test was False when it should have been True”:

The numeric assertion reports the actual number found, as well as both of the boundaries
The Should().NotContainNulls() collection assertion reports how many nulls and where they were in the collection, rather than just declaring the abstract presence on at least one unexpected null
The string assertions report what value was returned, and where it differed from the expected value

These are what I call “Expressive” assertions and none of them are ground-breaking changes from the manual alternative, nor are their absence a catastrophic problem…it’s invisible tech debt. The developer could find the actual values by running the test locally in the debugger^[1]. The developer could write manual logs of all the values. But both options are a tiny extra cost to development time.

Using an assertion library ensures that the full detail is always available, for free, and puts it in front of the developer as soon as the test completes. In your CI tooling they get that information without even having checked out the code locally. It won’t save a lot of time in any one case – but it’ll save a little bit of time every time a test fails!

It also improves the expressiveness of the test code in the first place. Double Win!

jakob-owens-dGXGDairmK0-unsplash - Softwire

Exception details being swallowed or obfuscated causes invisible tech debt

We talked about the happy case in which a test fails when it gets to the assertions. But the other common form of test failure can often be much more problematic: That’s the case where the tests never get to the assertion – where it’s failed due to an exception thrown somewhere in the module that you’re invoking for your test.

A good, expressive test error message in that situation is one that still contains the details of the original exception – the first thing that went wrong. Ideally it would also contain details of any layers of error handling that attempted to deal with this; but having the originating error is the key detail.

Integration test suites, however, are moderately prone to losing that information for one reason or another. It’s much more common for tests to be failing deep in the centre of the code they’re testing – a long way from the original inputs, and well before the final outputs have been generated.

That means the error passes through many layers of infrastructural code before it bubbles out to the tests: plenty of layers in which the relevant details could get lost or discarded. Perhaps they were deliberately discarded in favour of “cleaner” user-friendly errors, or perhaps the original error caused some cascade of downstream errors and the test is only getting told about the last of those, without telling you anything about the original cause.

Often these problems hinge on infrastructural setup of integration tests, especially where the code starts interacting with other systems rather than being pure business logic; for example, interactions with databases or APIs, test scenario setup code, or authentication filters.

When your test suite has problems like these, your developers will repeatedly have this experience:

“I made a bunch of changes and then ran the tests. Something broke, but the specific error message I’m getting isn’t about what really broke – it’s about some secondary fallout, and the original error is lost…so I don’t know what part of my change caused the problem or even what the direct problem actually is!”

There are countless specific ways that this can come about, and each one requires a different solution, but all of them cause the same end result: invisible tech debt.

Below I describe three examples of this pattern of inexpressive and obfuscatory error handling, covering:

DB access transaction boundaries
OWIN pipeline request handling
API response detail capturing

As you will see, (and as I foreshadowed in the previous article 😁), the nature of the right fix varies widely with the specific cause of each problem. Sometimes the fix is a smallish change to a handful of classes, sometimes a helper method to be reused everywhere, and sometimes, yes, it’s “spend a bunch of time as a one-off to fix a deep underlying issue that’s causing particular pain”.

Examples of fixing inexpressive test error messages

Pre-empting `TransactionScopes` creation failures

A system I worked on leant heavily on TransactionScopes, a .NET utility for managing atomicity in DB operations. TransactionScopes help solve the challenge of atomicity in a complex codebase with lots of shared code being reused in different invocation paths, and at different degrees of abstraction. Rather than explicitly declaring the starts and ends of Transactions – requiring attention to the implications of nested Transactions being created or committed early – you use TransactionScope to express a desire for code blocks to be managed atomically, and the TransactionScope library handles the creation and committing of new Transactions.

TransactionScopes are great, but they have one flaw: if you try to create a new TransactionScope after the current Transaction throws an error, then the creation attempt itself throws an error. That’s not inherently unreasonable, especially in live code, but in an integration test it causes the exact problem I’ve described above: the original SQL Exception gets overridden, and we lose the detail of the actual problem.

Checking across the team we found that in the last month most devs on the team had spent at least an hour trying to locate the source of an error hidden in this way, and an unlucky few had spent significantly more!

I added a custom wrapper around the native .NET TransactionScope class, which tracked exceptions thrown in earlier scopes and then reported those details if an attempt to create a new TransactionScope detected that a previous scope had aborted.

The work took around 4 dev-days in a 10-to-12-dev team, meaning the work recouped its costs within 3 months…on a system that has already been used and continuously enhanced for years, and will continue to exist for many more.

Post-processing requests that have already been disposed

Middleware pipelines make it easy to wrap request handlers around every request. This is extremely valuable for cross-cutting concerns like global error-handling, authentication, rate limiting, and request logging. In .NET, the OWIN framework provides a rich interface for defining middleware components, allowing complex pre- and post-processing on requests.

This time, post-processing was causing our issues. Naturally, when logging a request, you want to record a summary of the response that was sent back to the caller. But equally, you don’t want to force that caller to wait for this logging if it’s the only thing left to be done – you want to capture the info from the response, send the response back, and then deal with logging the details, after the user has received their data.

In our integration tests, we span up a test instance of the web server and sent it requests, before checking the response and the state of the DB.

But in the event that the server code failed, or returned any sort of non-success response code, we would check the code and then immediately discard both the request and response since we no longer needed them. That combined with the fact that it was the same process running both the test (sending the request) and the webserver (receiving it) caused a rare failure mode in our request logger – one of the OWIN middleware handlers we’d written. When writing post-processing logs, our handler attempted to access details of the original request, which then sometimes – depending on the precise timing of .NET garbage collection passes – no longer existed!

In these “bad” cases, the failure we would see was the post-processing complaining that it couldn’t find the request object anymore, rather than the output of the request-processing that it was providing logging about.

Worse, the way that the webserver was spun up in memory meant that we didn’t even get a clear error tying the failure to a particular part of the middleware pipeline – just a generic “the webserver has terminated unexpectedly” error. And all of this was completely unrelated to whether the code called by the test had done what we expected!

None of this ever happened in real use of the webserver: no “real” caller ever discarded the request so aggressively. And even if they did so, this wouldn’t have caused any practical problems – just some error records in logs, created after the request had been processed and the consumer had happily gone on their way.

But for our tests it was a huge issue, since it completely discarded the details of whatever real problems had occurred, which did need to be investigated. Even worse very, very occasionally the race-condition between ‘the logging using the request’ vs ‘the test assertions using and discarding the request’ could happen even when the request has succeeded. “Flaky” tests are a huge problem in their own right but in this situation, they compounded an already damaging situation.

So I fixed the outer problem with starting the webserver, to isolate the cause of the failure in the middleware. And then re-wrote the middleware code to explicitly capture, in memory, all the parts of the request that post-processing would depend on, before starting to return the response. Now the post-processing handlers aren’t referring to the original request object at all – only their separately stored copy of the relevant information – and they don’t care what the caller does with the returned response.

The broader benefits here weren’t isolated to this failure mode: having introduced the necessary infrastructure, I was able to share the code for this particular middleware component across all our custom middleware components, so request metadata was only fetched once – fractionally reducing memory footprint. And finally, I made that data capture happen on a background thread in parallel with the main request (rather than briefly blocking the pre-/post- processing) adding another fractional performance improvement.

Neither of these were gains noticeable in isolation (or in any single user request), but they were one more micro-improvement in the general system maintenance that reduced overall resource consumption on a high-request-volume system.

Overall, fixing what seemed to be a rare failure mode in one handler had multiple small impacts across multiple handlers. The fact that these were hard to scope and quantify doesn’t mean their combined impact wasn’t material.

Capturing and reporting full API responses

This was a scenario testing an API. One of our preliminary checks for each test was that the response code was 200-OK. But in the event of an unexpected 500-InternalServerError response, we were losing the response body, which of course had the error details and stack trace in it^[2], and only gave the “code was 500, not 200” information.

Any time this happened the dev would have to breakpoint into the test code and capture the response manually to find out what had broken. Not that much effort, but still more time to spend debugging than necessary.

The fix was correspondingly simple – I added a small utility method that took a response and an expected response code, checked the actual code, and included the entire response in the error message if the code was wrong. Ten minutes to write; another 35 to apply it to all existing tests and notify the other devs for future use.

This change was the first in a culture-shift of “we improve the tests to give the best information upfront”, after getting our client invested in the potential for improving development velocity (and saving future costs) by fixing some invisible tech debt.

¹ Mostly! If you’ve got a transient error that only happens 2% of the time, or between 00:00 and 01:00, or only on a leap day, then you might be in trouble!

² Because we were running a non-release build locally! Never return system-level details in a live scenario, for security reasons!