The World and the Machine: Avoiding Failure

Programmers need to address failure concerns—to avoid the design errors that enable common failures. Failure examples include numerical overflow, floating-point underflow, null pointer dereference, memory leakage, and using uninitialised variable values: all these are well-known. Compilers for some languages—such as SPARK Ada—diagnose many of these design errors by static analysis.

In software engineering for cyber-physical systems the same need is critical. In each development task a software engineer should be explicitly aware of a checklist [1]—a catalogue of failure concerns against which the work must be checked. The whole catalogue is open-ended. Responsible engineering disciplines are characterised by steadily increasing quality and scope of the concern catalogue: when a major failure occurs in system operation, rigorous investigation and analysis refine or enlarge existing concerns or add new ones. In response to several fatal crashes of the de Havilland Comet passenger aircraft in 1949-1951, the whole recoverable debris from one crash was raised from the sea bed and reassembled. It was eventually discovered that metal fatigue cracks had developed and grown in the corners of the square windows of the fuselage. Metal fatigue was explicitly recognised as a major failure concern in aircraft flying at high speeds at high altitudes: since then, rounded passenger windows in aircraft have been universal.

In cyber-physical systems, sources of failure are unbounded. Failure is never impossible, and developments in applications and technologies continually add new possibilities. If a system is more than a toy, and the consequences of failure cannot be neglected, an established catalogue of failure concerns, and the available means of addressing them, is an essential tool. System failures usually have many contributory causes, and failure concerns overlap and interact. But we may still focus our attention on catalogues of basic failure concerns for three broad areas of development. The first is triplet concerns, in the initial isolated development of individual triplets, including incremental complication for fault tolerance. The second is combination concerns, in combining constituent behaviours and eventually assembling the complete system. The third is model concerns, addressing the ever-present difficulty of developing formal models adequately faithful to non-formal physical realities.

The development, use, and continuing improvement of explicit catalogues of concerns may seem unnecessarily bureaucratic. It is not. In the processes of development and reviewing its products, many approaches focus—understandably—on what we might call the positive thrust of development. The developers ask themselves “What are the requirements?”, and the reviewers ask themselves whether the resulting products satisfy the requirements. But this emphasis on the positive has a strong self-limiting tendency: working to an agenda whose items are achievements of product success. A complementary agenda of finding product failures is no less—perhaps far more—important. Looking for potential failures is hard. And it’s even harder if you don’t know what you are looking for.

[1] Atul Gawande; The Checklist Manifesto: How to Get Things Right; Henry Holt, 2009.

The World and the Machine

Pages

Sunday, 8 December 2019

Avoiding Failure

No comments:

Post a Comment