Wednesday, 4 December 2019


At the enquiry into the 1986 Challenger space shuttle disaster, Richard Feynman famously demonstrated the effect of low temperature on the O-ring seals of the joints in the shuttle's Solid Rocket Booster. His more general observations on the shuttle's reliability are less well known, but deserve attention. In a personal appendix to the enquiry report [1] he justified engineers' usual adoption of "the component system, or bottom-up design". They start at the bottom by thoroughly understanding the materials to be used, and work upwards level by level to the final design of the entire engine. He goes on to criticise in detail the way in which the Space Shuttle Main Engine was designed "in a different manner, top down, we might say".

Feynman's observations apply no less to the design of complex system behaviour. From carefully evaluated causal links at the bottom, to triplets and their failure concerns, to incremental complications and combinations, design progresses upwards to the whole behaviour, building always on a firm foundation of analysis and understanding of the preceding levels. The theme—common to the complexities of the shuttle and of cyber-physical system behaviour—can be summarised in two admonitions: "If you haven't understood the components yet you shouldn't begin to design their combination;" and "When a difficulty emerges in a lower level component, it will be far harder and more expensive to resolve, and perhaps even impossible, if the component is already embedded in the upper levels of a top down design."

Why, then, has top down design been a persistently attractive theme in software development? Because it can be effective in a formal setting, where large abstractions can be defined and instantiated with perfect confidence. "Let C be a circle with centre at O and radius r" precisely and completely defines an instance of a circle. We need not fear low-level anomalies: no arc of C will fail to be exactly circular; and no diameter will fail to subtend a right angle at the circumference. In such a formal setting, top down design and its cousin stepwise refinement can reason confidently.

Unfortunately, the physical world is not a formal setting—at least, not at the granularities and scales significant to cyber-physical systems. If we start by naming—and trying to define—a large abstraction to capture the whole system behaviour, and proceed to refine it step by step to smaller abstractions, to increasingly concrete combinations of components still to be studied and modelled, we must expect to encounter some unwelcome surprises. This difficulty cannot be overcome by intellectual determination, let alone by fiat. We cannot simply insist that at each level our abstractions will be disjoint and semantically precise.

Here's a tiny example—a top-level decomposition of automotive function: driving, cabin experience, driver information, and safety. The stop-start feature is in driving; the air-conditioning feature is in cabin experience. At a lower level the top-level decomposition is vitiated by a surprise interaction: air-conditioning's demands on the battery may prevent reliable engine starting, so these two features are completely incompatible. As always, the devil lurked in concrete levels yet to be explored, and cannot be exorcised by overconfident abstractions in top down decomposition.

[1] Personal observations on the reliability of the Shuttle; Appendix F of the Rogers Report, 1986.

No comments:

Post a comment