Daniel Brahneborg

May 15, 2023

On Reliable Software

In 2020 Mia and I bought tickets to go to SDD Conf, a conference on various software topics. An annoying pandemic put a stop to that, but this year we finally made it. On the first day we attended a workshop on reliability, by Jules May. Here are some notes from this workshop, as I understood his points.

First, some key ideas.

  • Use a singleton "null" object for each class, which happily supports the same interface as all other objects of that class. Obviously this prevents any null pointer issues, but it goes deeper, as it enables all operations to stay within the group (hello group theory from mathematics). The first object of the null array is then simply the null object for its underlying type, again preventing null pointer issues. For operations that return basic types, such as length() and whatnot, escaping to the real world may be ok, returning an actual 0 or whatever is appropriate.
  • Following the "broken window" theory, the list of known bugs should be 0. This prevents the system from degrading. With tons of "unimportant" bugs, noticing the potentially fatal ones could be impossible.
  • Be prepared to throw away the first implementation of everything. This version should still be written, as it provides lots of new understanding of what the system should actually do. It may be useful to write it in a simpler language such as Python, both for productivity and to prevent anybody from thinking it is actually production level code. For a long time I even had to also throw away the second implementation, but maybe I'm an unusually bad programmer.
  • Make the system failsafe, making all problematic situations or bad input force the system into a safe state. Close files and sockets, return "null" objects, or whatever it takes.

Some things which may be easy to change, as starting points.

  • Start each function by testing for all known problematic conditions, returning however is appropriate. This way the rest of the function can trust parameters not to be null, indices not to be negative, and so on.
  • An extended variant of this is to protect the perimeter, which means you always check any data received from external system for its validity, convert to proper types, or whatever is appropriate. So, when the rest of the system gets a value which should be a date, it is either a proper date or the null date. In both cases it can perform any date operations it wants on it, and never run into problems.
  • Most if statements in the rest of the function should now be possible to convert to enums, isolating each test to a single place in the code. When used in a switch statement, all decent compilers now give a helpful error message at compile time for unhandled new situations.
  • It may even be possible to force the remaining if statements to return from the function if their condition is true.
  • What remains now is the happy path, which is what the function actually wants to do. What this is, and that it is correct, is often trivial to show.

Some more advanced and language specific things:

  • Avoid exceptions. Their main problem is that they resume the execution after the problematic code, making it difficult to know how to recover. The old C functions setjmp and longjmp do the opposite, returning to before the problematic code. The caller can then decide to do something else. Erlang has the same approach, letting programs kill themselves when they run into problems, and then being restarted from the beginning so they can try again.
  • Make interfaces impossible to use incorrectly. An easy example is functions such as "withOpenFile(name)", which makes it impossible to forget to close the file afterwards.
  • Make complex classes "fly by wire". The callers feed it events and requests, and get callbacks at appropriate times, perhaps most often after the requested action has been performed. Again, the interface becomes impossible to use incorrectly. This also results in a separation between the complex general code, and the simpler caller specific code.

Some of this I already do, and some of it are things I will try, as they are natural extensions of my existing code.