The guide targets various failure modes due to various hazards, contributing to many incidences. The guide claims to avoid many of the failure modes, by providing guidelines to defend against the hazards.

The need to consider operational errors

Studies about the sources of critical accidents in operating human-made systems indicate that most of them are commonly attributed to errors made by the human operators.

The developer's responsibility

  Common studies on resilience engineering focus on analysis of incidences, stressing the management role and responsibility in preventing misfortune by safety culture (e.g., Hollnagel et al., 2006 ). In contrast, this guide focuses on aspects of system design. Specifically, it is about preventing design mistakes.

Typical attitude to incidences is emotion driven: instead of considering the prevention of the next incidence, people look for assigning the incidence to specific people. Following an incidence or an accident, the stakeholders typically focus on accountability issues, looking for "bad apples" ( Dekker, 2007 ) instead of on improving the safety.

Engineering role

  This guide assumes that it is the responsibility of the system engineers to prevent mishaps, and this guide is about achieving this goal. It is the developer's responsibility to eliminate the likelihood and to mitigate the costs of all hazards, and to  prevent predictable incidences, including  those typically regarded as errors. Risky situations should be expected and the design should include means to mitigate these risks.

Modeling the system resilience

Analysis of operational failure is possible when it is based on models of incidence generation ... . These models describe typical ways of hazard generation and development. The concepts underlying the methodology presented here are illustrated in the extended Swiss cheese model .... Key models used here are

Slips and confusion: not errors

More than 2000 years ago, the Roman philosopher Cicero (Wiki ...) already observed that "to err is human". The 18th century English writer Alexander Pope ( Wiki ...) added that "to forgive, divine", suggesting that the term " error" is accountability biased. This term is used extensively in investigations, to justify distracting the discussion from costly investment in resilience assurance, to cheap and handy personnel changes.

Statistics about the sources of accidents indicate that most of them are commonly attributed to the human factor ( statistics). This guide focuses on preventing errors typically attributed to the human operators. In this guide we focus on the sources of errors: psychomotoric limitations result in slips, and information related problems result in confusion.

Focus on exceptional situations

Incidence analysis of these incidences indicates that most of them involve operator's difficulties in handling exceptional situations. It turns out that the operator's errors are typically due the operator's difficulties in handling such situations.

Defenses

The  models of the system resilience enable definition and design of defense layers, including methods for hazard prevention and error protection, both proactively and reactively

Confusion prevention

The resilience models include representations of the user's and operator's behavior, such as the ways they perceive and understand the operational procedures and behavior. Resilience-oriented design enables preventing predictable interaction flaws, typically attributed to operator's errors.

Defense evaluation

Any defense added to the system introduces new hazards, called threats. It is challenging to defend the system against the new hazards, and to evaluate the costs of the various defense options, and the threats that they introduce.

Rule-based design

Rules integrated in the behavior knowledge base about the system operation in normal and exceptional conditions enable automated hazard detection and unexpected events.

Implementation

A special architecture must be designed to support the various features for consistency assurance, hazard detection and alarming.

Resilience development

The system resilience develops in cycles, starting with proactive assurance, followed by reactive assurance. Proactive assurance is about designing the protection layers, and reactive assurance is about learning from incidences. Small cycles allow fine tuning of the operational rules at the development site (part of alpha testing) and large cycles are used to learn from real incidences, at the customer site.


Updated on 05 Apr 2017.