Test-Driven Development and Embracing Failure

At the last London XpDay, some teams talked about their “post-XP” approach. In particular, they don’t do much Test-Driven Development because they find it’s not worth the effort. I visited one of them, Forward, and saw how they’d partitioned their system into composable actors, each of which was small enough to fit into a couple of screens of Ruby. They release new code to a single server in their farm, watching the traffic statistics that result. If it’s successful, they carefully propagate it out to the rest of the farm. If not, they pull it and try something else. In their world, the improvement in traffic statistics, the end benefit of the feature, is what they look for, not the implemented functionality.
I think this fits into Dave Snowden’s Cynefin framework, where he distinguishes between the ordered and unordered domains. In the ordered domain, causes lead to effects. This might be difficult to see and require an expert to interpret, but essentially we expect to see the same results when we repeat an action. In the complex, unordered domain, there is no such promise. For example, we know that flocking birds are driven by three simple rules but we can’t predict exactly where a flock will go next. Groups of people are even more complex, as conscious individuals can change the structure of a system whilst being part of it. We need different techniques for working with ordered and unordered systems, as anyone who’s tried to impose order on a gang of unruly programmers will know.
Loosely, we use rules and expert knowledge for ordered systems, the appropriate actions can be decided from outside the system. Much of the software we’re commissioned to build is about lowering the cost of expertise by encoding human decision-making. This works for, say ticket processing, but is problematic for complex domains where the result of an action is literally unknowable. There, the best we can do to influence a system is to try probing it and be prepared to respond quickly to whatever happens. Joseph Pelrine uses the example of a house party—a good host knows when to introduce people, when to top up the drinks, and when to rescue someone from that awful bore from IT. A party where everyone is instructed to re-enact all the moves from last time is unlikely to be equally successful1. Online start-ups are another example of operating in a complex environment: the Internet. Nobody really knows what all those people will do, so the best option is to act, to ship something, and then respond as the behaviour becomes clearer.
Snowden distinguishes between “fail-safe” and “safe-fail” initiatives. We use use fail-safe techniques for ordered systems because we know what’s supposed to happen and it’s more effective to get things right—we want a build system that just works. We use safe-fail techniques for unordered systems because the best we can do is to try different actions, none of which is large enough to damage the system, until we find something that takes us in the right direction—with a room full of excitable children we might try playing a video to see if it calms them down.
At the technical level, Test-Driven Development is largely fail-safe. It allows us, amongst other benefits, to develop code that just works (for multiple meanings of “work”). We take a little extra time around the writing of the code, which more than pays back within the larger development cycle. At higher levels, TDD can support safe-fail development because it lowers the cost of changing our mind later. This allows us to take an interim decision now about which small feature to implement next or which design to choose. We can afford to revisit it later when we’ve seen the result without crashing the whole project.
Continuous deployment environments such as at Forward2, on the other hand, emphasize “safe-fail”. The system is partitioned up so that no individual change can damage it, and the feedback loop is tight enough that the team can detect and respond to changes very quickly. That said, even the niftiest lean start-up will have fail-safe elements too, a sustained network failure or a data breach could be the end of the company. Start-ups that fail to understand this end up teetering on the edge of disaster.
We’ve learned a lot over the last ten years about how to tune our development practices. Test-Driven Development is no more “over” than Object-Orientation is, it’s just that we understand better how to apply it. I think our early understanding was coloured by the fact that the original eXtreme Programming project, C3, was payroll, an ordered system; I don’t want my pay cheque worked out by trying some numbers and seeing who complains3. We learned to Embrace Change, that it’s a sign of a healthy development environment rather than a problem. As we’ve expanded into less predictable domains, we’re also learning to Embrace Failure.

1) this is a pretty good description of many “Best Practice” initiatives
2) Fred George has been documenting safe-fail in the organisation of his development group too, he calls it “Programmer Anarchy
3) although I’ve seen shops that come close to this

10 replies on “Test-Driven Development and Embracing Failure”

  1. Hm. Your argument seems to depend very heavily on Forward as a case study, and it’s not clear what their safe-fail code is doing. I can’t believe it’s their entire system; a comparison site (f’rinstance) needs to be at least somewhat accurate, and accuracy can’t be inferred from traffic statistics. Is it more like A/B testing of conversion rates for alternative presentations? Or like performance/throughput optimization, which is notoriously hard to predict accurately?
    I’d have thought there were at least 3 criteria to be satisfied before adopting this sort of approach:
    1) Technical; the system needs to have fairly loose coupling.
    2) Epistemic: the success or failure of the system needs to be reliably inferrable via monitoring.
    3) Legal/ethical/financial: failure needs to not matter too much (as your payroll example points out).
    1 is becoming the norm these days with so much of the world moving toward dynamic languages, but I’d have thought 2 and 3 will clamp down heavily on applicability. I suppose it’s a per-aspect question, rather than per-system or per-team. You wouldn’t roll out the transaction logic of a commerce site this way, but you might experiment with usability tweaks to the same site.
    – Mike
    P.S. I’ve heard indirectly of a *very* prominent company rolling out updates like this – a few servers at a time followed by a wait to see if anyone screams – on a tech base and in a business domain where you really wouldn’t expect it. So maybe I’m wrong about the limitations, although I suspect it’s more that some legacy systems are so opaque and brittle that there’s no practical alternative.
    P.P.S. how are people managing the traceability side of this, i.e. keeping track of and being able to reliably reproduce the contents of a given server? Is a DVCS like Git enough?

  2. @Mike I used Forward because it was a visit there that inspired the thought, but it seems to be a common approach for certain applications. And, yes, it’s all about picking the right response to the domain. We seem to have learned more about how to do that recently. I don’t know how they do traceability, but I expect it’s an ordered approach.

  3. @bruce Actually, I’m not sure which side of the fence the chaos monkey belongs to. It looks like it’s there to make sure that known things don’t fail, but by introducing a level of disorder 🙂

  4. Thought-provoking as usual Steve!
    There are certainly some environments or components where embracing failure wouldn’t work. For example, some errors are both dangerous and not immediately obvious, such as those in an accounting or compliance system. You don’t want to find out in an audit that for the past quarter you’ve been off by a penny in half your transaction logs – the sort of thing your production monitoring is likely to miss but which could mean a business catastrophe.
    For what it’s worth, we find at youDevise that one of our products has this characteristic, though less severely – we often locate errors 2-3 days after they begin affecting customers, simply because most users are not attentive enough to pick them up (this application is used for only a few minutes a day by very busy people). However once someone does detect the error, we have to notify everyone and that’s a knock to our reputation. So I don’t think we’ll be adopting Forward’s safe-fail philosophy on this application anytime soon, though we might consider it on less-critical components.

  5. @Squirrel As always, it’s about finding the appropriate response. If you need predictability, then you’re likely to be in an ordered domain (which youDevise mostly is) and safe-fail won’t apply. But if, for example, you were to add a social networking service, then that would change the game.

  6. Once I realized you’d gotten the meaning of “fail-safe” backwards, the rest was irrelevant.

Comments are closed.