ISO 26262-Compliant Safety Analysis Methods

Yuzhu Yang 杨玉柱 & Prof. Dr. Mirko Conrad

The advancement of safety-related Electrical/Electronic (E/E) systems in the automotive industry is predominantly associated with functional safety. The critical aspect of functional safety system development is the safety analyses in accordance with the ISO 26262 standard. The ISO 26262 standard provides recommendations for the methodology of safety analyses. What analytical methods are employed in the analysis of automotive safety-related systems? The objective is to gain insight into the classification and application of these methods in the context of safety system analyses, with a view to supporting the development of ISO 26262-certified products. This article begins by addressing these questions and subsequently introduces the analytical methods used in the development of safety-related systems in the automotive industry, along with the best practices associated with these methods.

The Purpose of Safety Analyses

Start with the purpose of automotive safety analyses: why is it necessary to conduct safety analyses when developing safety-related automotive E/E systems?

ISO 26262 defines functional safety as: "Absence of unreasonable risk due to hazards caused by malfunctioning behavior of E/E systems" (ISO 26262:2018). Typically, malfunctions in E/E systems are caused by two types of failures:

  • Systematic Failure: Failures that are clearly related to a cause and can only be eliminated by modifying the designs or manufacturing processes, operating procedures, documents, or other associated elements.
  • Random Hardware Failure:  Failures that occur unintentionally within the life cycle of hardware elements and follow a probability distribution.

Therefore, the aim of safety analyses is to ensure that the risk of safety goal violations due to systematic or random failures is sufficiently low.

It is important to note that, according to ISO 26262, the occurrence rate will not be discussed in systematic failures. However, measures that prevent systematic failures help to reduce the overall risks of safety goals or requirement violations.

Scope of Safety Analyses

The scope of safety analyses include:

  • Validation of safety goals and safety concepts.
  • Verification of safety concepts and safety requirements.
  • Identification of conditions and causes (incl. faults / failures) that could lead to the violation of a safety goal or safety requirement.
  • Identification of additional safety requirements for detection of faults / failures.
  • Determination of the required responses to detected faults / failures.
  • Identification of additional measures to verify that the safety goals or safety requirements are complied with.

Implementation of Safety Analyses

Depending on different applications, safety analyses can be achieved by:

  • Identify new hazards not previously identified during the HARA
  • Identify faults, or failures, that can lead to the violation of a safety goal / safety requirement
  • Identify potential causes of such failures
  • Support the definition of safety measures for fault prevention / fault control
  • Provide evidence for the suitability of safety concepts
  • Support the verification of safety concepts, safety requirements
  • Support the identification of design and test requirements

That is to say: as for the related items to be analyzed, based on their safety concepts, safety goals are derived from HARA analyses, and then derived the safety requirements. Further consideration is given to safety requirements to detect these faults or failures. Then, according to detected faults or failures, the following processes or measures are determined. Last but not the least, determine the additional measures to verify whether the implemented safety measures meet the corresponding safety requirements and / or goals.

Introduction to Safety Analysis Methods

Qualitative and Quantitative Methods

1. Qualitative Safety Analysis Methods

The methods of qualitative safety analyses primarily include:

  • Qualitative FMEA at system, design, or process level
  • Qualitative FTA
  • Hazard and operability Analysis (HAZOP)
  • Qualitative ETA

Qualitative analysis methods are particularly suitable for software safety analyses where no more appropriate methods can be specifically used.

2. Quantitative Safety Analyses Methods

Quantitative safety analysis methods complement qualitative safety analysis methods and are primarily utilized for hardware architectural metrics and random hardware failures evaluation. The resulting safety goals violate the defined objectives of the assessment to validate the hardware design (please refer to ISO 26262:2018, 5 and 8). Quantitative safety analyses require additional information of quantitative failure rates of hardware elements.

The methods of quantitave safety analyses primarily include:

  • Quantitative FMEA Analyses
  • Quantitative FTA Analyses
  • Quantitative ETA Analyses
  • Markov Model
  • Reliability block diagrams

3. Differences and Connections Between Quantitative and Qualitative Analyses

3.1 Differences Between Quantitiave and Qualitative Analyses

The distinction between these two methods is as follows:

Quantitative analyses is concerned with the prediction of failure rates, whereas qualitative analysis is focused on the identification of failures, without the capacity to predict their associated failure rates. Qualitative safety analyses methods are generic in nature and can be applied at the system, hardware, and software levels. In order to conduct a quantitative safety analyses, it is necessary to have additional knowledge of the quantitative failure rates of the relevant hardware elements. In the context of ISO 26262, it is employed for the purpose of validating the assessment of hardware architectural design metrics that evaluate violations of safety objectives due to random hardware failures.

3.2 Connections Between Quantitiave and Qualitative Analyses

Both methods rely on understanding the relevant failure types or faiure modes. Quantitative safety analyses elements qualitatitive analyses. In engineering applications, both approaches should be used in conjunction.

Inductive and Deductive Analyses

Except from qualitative and quantitative methods, safety analysis methods can also be classified based on their executions: the inductive and deductive analyses.

1. Introduction to Inductive and Deductive Analyses

Inductive analyses, also known as bottom-up methods. Starting from cause, then from the bottom to the top, induct the possible results from causes to ensure the potential failure. In contrast, deductive analyses follow top-down methods, from failures to trace back to possible causes.

Common Safety Analysis Methods

Several methods have been used in engineering applications. For example, FMEA (Failure Mode and Effects Analysis) and FTA (Fault Tree Analysis) are two standard methods for analyzing faults and failures of items and elements under the ISO 26262 structure. FMEDA is typically implemented when the system is being designed to meet relative ASIL requirements. ETA or RBD (Reliability Block Diagrams) can also be used to perform safety-related analyses.

Figure 1. FMEA Handbook
Figure 1. FMEA Handbook

Failure Mode and Effects Analysis (FMEA)

Failure Mode and Effects Analysis (FMEA) is one of the earliest techniques for failure analyses, which was developed by reliability engineers to study possible malfunctional problems that may arise from military systems in 1940s. It was later adopted by the automotive industry as international standards in 1970s. The most widely used procedures at present is recorded in handbook authored by AIAG (Automotive Industry Action Group) and VDA (German Association of the Automotive Industry) in 2019 (as shown in Figure 1) for FMEA and assist suppliers in their development work.

The handbook was developed by OEMs and tier 1 supplier matter experts (SMEs). It integrates the best practices from AIAG and VDA to form structural methodologies, which includes design FMEA, process FMEA, and FMEA supplements of prevention and detection measures. FMEA focuses on technical risks, and it is an analytical method for preventive quality management and monitoring in product design and production processes.

Figure 2. FMEA Diagram, Buttom-up Approach - ISO 26262-10:2018(E)
Figure 2. FMEA Diagram, Buttom-up Approach - ISO 26262-10:2018(E)

The FMEA analysis is known for its beginning with analyzing causes of malfunctions of each structural element in the system to study malfunctions of elements inductively, and thus to develop the optimization measures for potential unacceptable failures. As a typical application in the automotive industry, FMEA can be performed by qualitative or quantitative methods in analyzing failures and malfunctions in safety-system designs. Usually performed as an inductive (bottom-up) approach, FMEA focuses on how failures present among system elements, and how these failures affect systems.

Figure 3. FTA Diagram, Top-down Approach - ISO 26262 - 10:2018(E)
Figure 3. FTA Diagram, Top-down Approach - ISO 26262 - 10:2018(E)

Failure Modes, Effects and Diagnostic Coverage Analysis (FMEDA)

Failure Modes, Effects and Diagnostic Coverage Analysis (FMEDA) methodology was first developed by exida in the 1990s. In 2011, the functional safety standard ISO 26262 adopted this method as a recommendation. The method FMEDA can be considered as the quantitative extension of the FMEA methods, as FMEDA considers quantitative failure rates for the hardware elements, which are defined as failure rates and the distributions of failure mode for these elements, and it identifies critical failure modes by considering the safety mechanisms of the corresponding failure modes and their diagnostic coverage.

The FMEDA method is mainly utilized in the phases of hardware architectural design and hardware detailed design. We need to calculate the hardware architectural metrics at the hardware design level, such as single-point fault metric (SPFM) and Latent fault metric (LFM). Iterative use of FMEDA method can improve the hardware designs.

Fault Tree Analysis (FTA)

Fault Tree Analysis (FTA), developed in the early 1960s by Bell Labs, was used to evaluate a ballistic missile launch system. This analysis method was then standardized by IEC in 2006 and has been referenced by standards such as ISO 26262 as a potential or recommended analysis method.

We can apply FTA in qualitative or quantitative analysis techniques. For example, starting with the qualitative fault analysis, then we implement the quantitative statistics to the analysis to strengthen the analysis and get the quantitative variant.

Contrast to FMEA, FTA is a deductive (top-down) analysis method (see Figure 3) that helps us identify base events or combination of base events that may lead the defined top event to failure. Typically, the top event is an undesired system event that violates a safety goal, or a safety requirement derived from a safety goal.

If we want to establish an FTA, we can start with the undesired top event, and progressively build a graphical tree structure. The interaction of potential causes for the undesired event is represented by Boolean logic operations, such as AND gate, OR gate and NOT gate. The quantitative variant of FTA can be implemented to calculate our third PMHF metrics, which is also the recommended method by ISO 26262 standards.

Figure 4. Classification and Integration of Analytical Methods
Figure 4. Classification and Integration of Analytical Methods

Comprehensive Application of Safety Analysis Methods

In practical engineering applications, these two classification methods can be combined to form the classification plan shown in figure 4. In the development of safety-critical E/E systems, combining top-down methods (such as FTA) and bottom-up methods (such as FMEA) allows for the indication of failure mode of semiconductor elements, which can then be applied to the element level. Starting from a lower level of abstraction, a quantitatively precise failure distribution assessment of semiconductor elements can be performed, with their failure distribution based on qualitative distribution assumptions.

Figure 5. FTA and FMEA Combined Analyses Diagram - ISO 26262 - 10:2018(E)
Figure 5. FTA and FMEA Combined Analyses Diagram - ISO 26262 - 10:2018(E)

E/E systems are composed of numerous elements can sub-components. FTA and FMEA can be combined to provide a complementary safety analysis method that balances top-down and bottom-up methods. Figure 5 shows a possible combination of FTA and FMEA. The base events originate from different FMEA (marked as FMEA A-E in this example), these base events derived from analyses conducted at a lower level of abstraction (such as sub-components, components or module levels). In this case, base events 1 and 2 derive from the failures detected in FMEA D, and failures from FMEA B are not used in FTA.

Figure 6. Safety Analyses in the Safety Life Cycle
Figure 6. Safety Analyses in the Safety Life Cycle

Safety Analysis Methods within the Safety Life Cycle

Mapping of Security Analyses and Safety Life Cycle

ISO 26262 refers to the safety life cycle, including main safety activities in the concept phase, which is product development, production, operation, service and decommissioning. As an important content in product development, safety analyses shall be implemented at system, hardware and software levels. The level of detail of the failure model description in the safety analysis depends on the level of detail analyzed in the corresponding development subphase and is consistent within that subphase (see Figure 6). For example, in concept phase, the safety analyses are performed based on the initial architecture at the appropriate level of abstraction. In the product development phase, the level of detail analysis can depend on the analysis phase with respect to the applied safety mechanisms.

Figure 7. Safety Analyses in the Hardware Design Phase of the Safety Life Cycle
Figure 7. Safety Analyses in the Hardware Design Phase of the Safety Life Cycle

Typically, safety analyses are associated with design phase activities (e.g., concept phase, system and hardware development phases). Safety analyses are associated with activities in the concept, system and hardware development phases (e.g., activities such as architectural design and integration verification of system hardware) (Figure 7), and similarly, in the software development phase, safety analyses are associated with software development activities, such as software architecture unit design and verification activities (Figure 8).

Figure 8. Safety Analyses in the Software Design Phase of the Safety Life Cycle
Figure 8. Safety Analyses in the Software Design Phase of the Safety Life Cycle

Safety Analyses in Concept Phase

In the concept phase of functional safety, ISO 26262 recommends implementing qualitative safety analysis to support the derivation of a set of valid functional safety requirements, especially FMEA, MTA or HAZOP methods, as mentioned in the standard. In the concept phase of system development, we need to firstly perform a qualitative safety analysis to the system architecture design to provide evidence of the applicability of the system design. By providing specified safety-relevant functions and attributes, such as analyzing requirements for independence or immunity, to provide specified safety-relevant functionality and safety-relevant attributes, such as analyzing the requirements for the independence of or interference immunity between parts or elements of the system, and to determine the causes of failure and the effects of malfunction. Second, if safety-related elements and interfaces have already been defined, safety analyses help us to identify or recognize unknown new safety elements and interfaces. Finally, safety analyses support the specification of improved safety designs by validating the effectiveness of safety mechanisms against the identified causes of failure and effects of failure.

Consideration of the potential adverse effects of expected functional safety and cybersafety on the realization of functional safety contributes to the overall development of a safe E/E system. Similar instructions apply to subsequent phases of development, and the content of the expected functional safety and cybersafety analysis is beyond the scope of this article.

Figure 9. Failure Classification of Safety-Related Hardware Components for Relevant Items - ISO 26262 - 5:2018(E)
Figure 9. Failure Classification of Safety-Related Hardware Components for Relevant Items - ISO 26262 - 5:2018(E)

Safety Analyses in Hardware Design Phase

1. Qualitative Safety Analyses in Hardware Design Phase

In the hardware design phase, we combine the application of various safety analysis techniques. On the one hand, there is the qualitative safety analysis of the hardware design. For example, qualitative FTA methods help us to identify the causes of failures and the effects of hardware failures. For safety-relevant hardware components or hardware units, qualitative FMEA helps us to identify different types of failures, in particular which are safety failures, which are single-point failures or residual failures, and which are multi-point failures. Similarly, according to the specification recommendations of the ISO 26262R standard, we need to use a combination of deductive and inductive analysis methods for safety analyses.

2. Quantitative Safety Analyses in Hardware Design Phase

On the other hand, when we discuss random hardware failures, we need to perform quantitative safety analysis related to hardware design. Quantitative safety analysis helps us to evaluate or calculate metrics related to hardware architecture design. Hardware architecture design metrics include single point of failure metrics (SPFM), latent failure metrics (LFM), and stochastic hardware failure rate metrics (PMHF). Typically, we use the Failure Mode Effects and Diagnostic Analysis (FMEDA) methodology to perform a quantitative analysis to evaluate the applicability of the hardware architecture design given by the validation in detecting and controlling safety-related random hardware failures. This is done by analyzing scenarios where safety objectives are violated due to random hardware failures and thus calculating specific metric values for the hardware architecture design in question.

2.1. Failure classification of safety-related hardware components

Failures occurring in safety-related hardware components should be categorized as:

a) Single point fault

b) Residual fault

c) Multiple point fault

d) Safe fault

Multi-point faults need to be differentiated between latent, detected and perceived faults. Thus, safety-related hardware Elements failures are categorized as in Figure 9.

Figure 10. Classification of Failure Modes for Hardware Elements
Figure 10. Classification of Failure Modes for Hardware Elements

And:

-Distance n indicates the number of independent failures that simultaneously lead to a violation of safety goals (single-point or residual failures n = 1, dual-point failures n = 2, etc.).

-Failures at distance n are located between the n-ring and n-1-ring.

-Unless explicitly addressed in the technical safety concept, multi-point failures with a distance strictly greater than 2 are considered safety-relevant failures.

Note that in the case of transient faults, the safety mechanism restores the relevant item to a fault-free state, even if the driver is never notified of its presence, and such faults are detectable multipoint faults. If, in the case of protecting the memory from transient faults using error-correcting codes, the safety mechanism repairs the contents of the flipped bits within the memory array (e.g. by writing back the correction value) in addition to supplying the CPU with the correct value, the item in question is restored to a fault-free state.

2.1.1. Single point fault

Failures of a hardware element which is not prevented by any safety mechanism, and which can directly lead to a violation of the safety objective. For example, unmonitored resistors with at least one failure mode (e.g. open circuit) that may violate the safety objective.

2.1.2. Residual fault

There is at least one safety mechanism to prevent a fault of a hardware element that violates a safety objective that can directly result in a violation of the safety objective. For example, checking a random memory (RAM) module using only the safety mechanism of the checkerboard RAM test does not detect certain kinds of bridging faults. Violations of the safety objective due to these faults cannot be covered by the safety mechanisms. These faults are known as residual faults, when the diagnostic coverage of the safety mechanism is less than 100%.

2.1.3. Detected two-point faults

Failures that are detected by the safety mechanisms preventing them from lurking and that can only lead to a failure that defeats the safety objective in conjunction with another (two-point failure-related) independent hardware failure. For example, a flash memory failure protected by parity, a single bit fault is detected in accordance with the technical safety concept and triggers a response such as: shutting down the system and informing the driver via a warning light.

2.1.4. Perceived two-point faults

Failures that are detectable or undetectable by the safety mechanism within a defined period and are perceivable by the driver and that can only result in a violation of the safety objective in combination with another (two-point failure-related) independent hardware failure. Examples are two-point failures where the function is clearly and unambiguously affected by the consequences of the failure, and which are perceivable by the driver.

2.1.5. Latent two-point faults

Failures that are not detected by the safety mechanism nor sensed by the driver, where the system is always operational, and the driver is not informed of the failure until a second independent hardware failure occurs.

For example, for EDC-protected flash memory: ECC corrects a single bit permanent fault value during a read, but this is not corrected in the flash memory and there is no signal to indicate this. In this case. The fault cannot lead to a violation of the safety objective (because the faulty bit has been corrected), but it is neither detectable (because there is no signaling of a single bit fault) nor sensible (because it has no impact on the functionality of the application). If an additional fault occurs in the EDC logic, it can lead to a loss of control over the individual bit faults, leading to a potential safety objective violation.

2.1.6. Safe failure

Safety failures include the following two categories:

a)All n-point failures with n > 2, unless the safety concept indicates that they are relevant factors that violate the safety objective; or

b)failures that do not lead to a violation of the safety objective.

An example is a single bit fault that is corrected by ECC but not signaled in the case of flash memory protected by ECC and cyclic redundancy check (CRC). The ECC prevents the fault from violating the safety objective, but the ECC does not signal it. If the ECC logic fails, the CRC will be able to detect the fault and the system will shut down. Only when a single bit fault exists in the flash memory, the ECC logics fails, and the CRC checksum and monitoring fails, The safety objective will be violated (n=3).

2.2. Failure modes and failure rates of hardware elements

2.2.1. Failure modes of hardware elements

According to the fault classification model, the failure modes of hardware elements are categorized as shown in Figure 10.

Figure 11. Flowchart for Failure Model Classification
Figure 11. Flowchart for Failure Model Classification

2.2.2. Hardware Element Failure Mode Classification Process

The failure mode classification process is shown in Figure 11.

And:

λSPF is the failure rate associated with single-point faults in hardware elements

λRF is the failure rate associated with residual faults in hardware elements

λMPF is the failure rate associated with multi-point faults in hardware elements

λS is the failure rate associated with safety faults in hardware elements

The faulure rate associated with multi-point faults in hardware elements, λMPF, can be expressed according to Equation (1-1) as follows:

λMPF = λMPF,DP + λMPF,L(1‑1)

where:

λMPF,DP is the failure rate associated with multi-point faults in hardware elements

λMPF,L is the failure rate associated with detected or perceived multi-point faults in hardware elements

2.3. Hardware Architecture Metrics

Hardware architecture metrics are used to assess the effectiveness of the associated item architecture in coping with random hardware failures.

The goals of hardware architecture metrics are:

  • Objectively evaluable: The metrics are verifiable and precise enough to distinguish between different architectures;
  • Support the evaluation of the final design (based on the detailed hardware design with accurate calculations);
  • Provide pass/fail criteria for hardware architectures based on ASIL levels;
  • Indicates the adequacy of coverage of safety mechanisms used to prevent the risk of single point or residual failures in the hardware architecture (single point fault metric);
  • Indicates the adequacy of coverage of safety mechanisms used to protect against the risk of latent failures in the hardware architecture (latent fault metric);
  • Deal with single point faults, residual faults, and latent faults;
  • Ensure the robustness of the hardware architecture;
  • Limited to safety-critical elements only;
  • Support applications at different element levels, such as assigning target values for vendor hardware elements. For example, target values can be assigned to microcontrollers or ECUs to facilitate distributed development.

2.3.1. Single Point of Faults Metrics

The single point fault metric reflects the robustness of the item to single point and residual faults through the coverage or design of the safety mechanisms (mainly safety faults). A high single-point fault metric indicates that the proportion of single-point and residual faults in the hardware of the subject item is low.

For hardware designs with safety objectives of ASIL (B), C, and D ratings, Equation (1-2) is used to determine the single point of fault metric:

Figure 12. Graphical Representation of Single-Point Fault Metrics - ISO 26262-5:0:2018(E)
Figure 12. Graphical Representation of Single-Point Fault Metrics - ISO 26262-5:0:2018(E)

Only safety-related hardware elements of relevant items are considered. Hardware elements for safety faults or n-order multipoint faults (n>2) may be omitted from the calculation unless they are explicitly related to technical safety concepts. A graphical representation of the single point fault metric is shown in Figure 12.

2.3.2. latent fault metric

The latent fault metric reflects the robustness of the relevant term to latent faults, either by overriding safety mechanisms or by the driver detecting the presence of a fault before a safety goal is violated, or by the design (mainly safety faults). A high latent fault metric implies a low percentage of latent faults in the hardware.

For hardware designs with ASIL (B), (C), and (D) safety objectives, equation (1-3) is used to determine the latent failure metric:

Figure 13. Graphical Representation of Latent Fault Metrics - ISO 26262-5:2018(E)
Figure 13. Graphical Representation of Latent Fault Metrics - ISO 26262-5:2018(E)

Only safety-related hardware elements of relevant items are considered. Hardware elements for safety faults or n-order multipoint faults (n>2) are omitted from the calculation unless explicitly relevant in the technical safety concept. A graphical representation of the latent fault metric is shown in Figure 13.

Table 1. Hardware Architecture Design Metrics and Standard Requirements
Table 1. Hardware Architecture Design Metrics and Standard Requirements

2.3.3 Probability of Random Hardware Failure (PMHF) Measurement

As shown in Equation (1-4), the formula for calculating PMHF value is:

PMHFest = λSPF + λRF+ λDPF_det × λDPF_latent × Tlifetime (1‑4)

For each failure mode, calculate its contribution to the total PMHF value as a percentage.

2.4 Hardware Architecture Metrics Target Values

For specific metrics of hardware architecture design, the standard provides corresponding target values (as shown in Table 1), which typically depend on the highest ASIL level that the hardware design needs to meet. For ASIL A levels, the standard does not recommend target values, for ASIL D levels, the standard recommends the most stringent target values, and for some cases of ASILB and ASILC, the metrics are recommendations rather than mandatory requirements in the sense of the standard.

Figure 14. Temporal Interference Leading to Cascading Failures - ISO 26262-6:2018(E)
Figure 14. Temporal Interference Leading to Cascading Failures - ISO 26262-6:2018(E)

Security Analyses in the software design phase

1. Safety-related functions and attributes

The derivation of software safety requirements should consider the safety-related functionality and safety-related attributes required by the software, where a failure of safety-related functions or a failure of safety-related attributes may result in a violation of the technical safety requirements assigned to the software.

Safety-related functions of software typically include:

  • Functions that enable safe execution of the nominal function
  • Functions that enable the system to achieve or maintain a safe or degraded state
  • Functions related to detecting, indicating, and mitigating failures of safety-relevant hardware elements
  • Self-testing or monitoring functions related to detecting, indicating, and mitigating the failure of the operating system, underlying software, or the application software itself
  • Functions related to on-board or off-board testing during production, operation, service, and end of life
  • Functions that allow modification of software during production and service
  • Functionality related to performance or time-critical operations

Safety-related attributes typically include:

  • Robustness to erroneous inputs
  • Independence or interference freedom between different functions
  • Fault tolerance of the software, etc.

2. Security analysis in the design phase of software architecture

During the software architecture design phase, we should perform the appropriate software safety analysis activities, which the standard calls safety-oriented analysis methods. Safety-oriented analysis is essentially qualitative. First, safety-oriented analysis helps provide evidence that the software is suitable to provide the specified safety-related function and attributes as required by the desired ASIL level. Secondly, safety-oriented analysis helps to identify or validate safety-related software content. Finally, safety-oriented analysis supports the development of safety mechanisms and thus verifies the effectiveness of safety measures.

Relying on the requirement of non-interference or sufficient independence between safety-related elements, the standard recommends performing a Dependent Failure Analysis (DFA). The DFA identifies relevant failures or potentially relevant failures and their effects. The objective and scope of the DFA depends on the sub-stage and the level of abstraction at which the analysis is performed, and the elements of the failure of interest are defined before the analysis is performed (e.g., in the safety plan). This is also a part of the qualitative analysis.

3. Analysis of safety-related failures at the software architecture level

Embedded software provides the ability to specified functions, behaviors, and attributes, as well as the integrity required by the assigned ASIL. By applying safety-related failure analysis at the software architecture level to check or confirm the integrity of the corresponding safety functions and attributes and the corresponding assigned ASIL requirements.

3.1 Purpose of safety analysis at the software architecture level

During the software architecture design phase, relevant failure analysis shall be conducted  to identify possible single events, faults or failures that could lead to failure behavior of multiple software elements that require independence (e.g., cascading and/or common cause failures, including common mode failures), and single events, faults or failures that may initiate a chain of causality leading to a violation of a safety requirement that propagates from one software element to another (e.g., cascading failures). Through the failure analysis at the software architecture design stage, the degree of independence or  interference freedom achieved between related software architecture elements can then be tested.

3.2 Software Architecture Level Safety Analysis

Relevant failure analysis at the software architecture level shall consider the following:

  • Identifying possible design weaknesses, conditions, errors, or failures that could trigger a causal chain leading to violation of safety requirements (e.g., using inductive or deductive methods)
  • Analyzing the consequences of possible faults, failures, or causal chains on the functions and attributes required of software architecture elements Analyzing functions and attributes that software architecture elements required and the consequences of possible faults, failures, or causal chains

3.3 Application Scenarios for Relevant Failure Analysis at the Software Architecture Level

The following scenarios may require failure analysis at the software architecture level:

  • Applying ASIL decomposition at the software level
  • Implementing software safety requirements, such as providing the evidence for the effectiveness of software safety mechanisms, where the independence between monitored elements and monitoring elements must be ensured

4. Deriving Security Mechanisms from Software Architecture Level Safety Analysis

The safety measures include safey mechanisms derived from safety-oriented analyses that cover issues related to random hardware failures and software failures. The results of the safety analyses performed at the software architecture level require the implementation of error detection safety mechanisms and error handling safety mechanisms.

4.1 Error Detection Safety Mechanism

Error detection safety mechanisms include:

  • Range checks on input and output data
  • Plausibility checks (e.g., using reference models of the desired behavior, assertion checks, or comparing signals from different sources)
  • Data error detection (e.g., error detection codes and redundant data storage)
  • External element monitoring (e.g., ASIC) or another software element that executes the program that performs the watchdog function. The monitoring can be logical, temporal, or both
  • Temporal monitoring of program execution
  • Diverse redundancy design
  • Access violation control mechanisms implemented in software or hardware, used to allow or deny access to safety-related shared resources

The example in Figure 14 illustrates the interference caused by conflicting use of shared resources (e.g., shared processing elements). The QM software element interferes with and prevents timely execution of ASIL software elements (this interference can also occur between software elements with different ASIL levels). The upper half of the figure shows the software execution without interference mechanisms. By introducing "checkpoints" into the software and implementing timeout monitoring of them, timing perturbations can be detected, enabling appropriate countermeasures.

4.2 Safety mechanisms for error handling

Safety mechanisms for error handling include:

  • Deactivation to achieve and maintain a safe state
  • Static recovery mechanisms (e.g., module recovery, backward recovery, forward recovery, and repeated recovery)
  • Graceful degradation by prioritizing functions to minimize the negative impact of potential failures on functional safety
  • Homogeneous redundancy in design, which focuses primarily on controlling the effects of transient or random failures in hardware executing similar software (e.g., temporary redundant execution of software)
  • Diverse redundancy in design, which involves designing different software in each parallel path and focuses mainly on preventing or controlling systematic failures in the software
  • Code for data correction
  • Access rights management implemented in software or hardware to grant or deny access to safety-related shared resources

It is important to note that a review of system-level software safety mechanisms (including robustness mechanisms) can be performed to analyze their potential influence on system behavior and their alignment with technical safety requirements.

References

  1. International Organization for Standardization. (2018). Road vehicles — Functional safety — (ISO 26262:2018). https://www.iso.org/standard