Table of Contents
Fetching ...

BEC: Bit-Level Static Analysis for Reliability against Soft Errors

Yousun Ko, Bernd Burgstaller

TL;DR

This work tackles the reliability challenge posed by soft errors in digital systems by introducing BEC, a static, bit-level analysis that tracks every fault site in the space $F = P \times V$ and classifies the effects of bit flips at compile time. It combines a forward Global Abstract Bit Value Analysis with a backward Bit-level Fault Index Coalescing Analysis to map bit corruptions to program semantics and to merge fault sites into equivalence classes, enabling pruning of fault-injection campaigns and reliability-aware instruction scheduling. Implemented in LLVM 16.0.0 for the RISCVISA/RISC-V architecture, BEC achieves up to $30.04\%$ fault-injection-run pruning (average $13.71\%$) and up to $13.11\%$ reliability improvements (average $4.94\%$) across eight benchmarks, without increasing the instruction count or the number of FI runs. The results demonstrate the practical value of bit-level static analysis for software-level reliability and motivate extensions to hardware synthesis and broader architecture coverage.

Abstract

Soft errors are a type of transient digital signal corruption that occurs in digital hardware components such as the internal flip-flops of CPU pipelines, the register file, memory cells, and even internal communication buses. Soft errors are caused by environmental radioactivity, magnetic interference, lasers, and temperature fluctuations, either unintentionally, or as part of a deliberate attempt to compromise a system and expose confidential data. We propose a bit-level error coalescing (BEC) static program analysis and its two use cases to understand and improve program reliability against soft errors. The BEC analysis tracks each bit corruption in the register file and classifies the effect of the corruption by its semantics at compile time. The usefulness of the proposed analysis is demonstrated in two scenarios, fault injection campaign pruning, and reliability-aware program transformation. Experimental results show that bit-level analysis pruned up to 30.04 % of exhaustive fault injection campaigns (13.71 % on average), without loss of accuracy. Program vulnerability was reduced by up to 13.11 % (4.94 % on average) through bit-level vulnerability-aware instruction scheduling. The analysis has been implemented within LLVM and evaluated on the RISC-V architecture. To the best of our knowledge, the proposed BEC analysis is the first bit-level compiler analysis for program reliability against soft errors. The proposed method is generic and not limited to a specific computer architecture.

BEC: Bit-Level Static Analysis for Reliability against Soft Errors

TL;DR

This work tackles the reliability challenge posed by soft errors in digital systems by introducing BEC, a static, bit-level analysis that tracks every fault site in the space and classifies the effects of bit flips at compile time. It combines a forward Global Abstract Bit Value Analysis with a backward Bit-level Fault Index Coalescing Analysis to map bit corruptions to program semantics and to merge fault sites into equivalence classes, enabling pruning of fault-injection campaigns and reliability-aware instruction scheduling. Implemented in LLVM 16.0.0 for the RISCVISA/RISC-V architecture, BEC achieves up to fault-injection-run pruning (average ) and up to reliability improvements (average ) across eight benchmarks, without increasing the instruction count or the number of FI runs. The results demonstrate the practical value of bit-level static analysis for software-level reliability and motivate extensions to hardware synthesis and broader architecture coverage.

Abstract

Soft errors are a type of transient digital signal corruption that occurs in digital hardware components such as the internal flip-flops of CPU pipelines, the register file, memory cells, and even internal communication buses. Soft errors are caused by environmental radioactivity, magnetic interference, lasers, and temperature fluctuations, either unintentionally, or as part of a deliberate attempt to compromise a system and expose confidential data. We propose a bit-level error coalescing (BEC) static program analysis and its two use cases to understand and improve program reliability against soft errors. The BEC analysis tracks each bit corruption in the register file and classifies the effect of the corruption by its semantics at compile time. The usefulness of the proposed analysis is demonstrated in two scenarios, fault injection campaign pruning, and reliability-aware program transformation. Experimental results show that bit-level analysis pruned up to 30.04 % of exhaustive fault injection campaigns (13.71 % on average), without loss of accuracy. Program vulnerability was reduced by up to 13.11 % (4.94 % on average) through bit-level vulnerability-aware instruction scheduling. The analysis has been implemented within LLVM and evaluated on the RISC-V architecture. To the best of our knowledge, the proposed BEC analysis is the first bit-level compiler analysis for program reliability against soft errors. The proposed method is generic and not limited to a specific computer architecture.
Paper Structure (19 sections, 4 figures, 4 tables, 4 algorithms)

This paper contains 19 sections, 4 figures, 4 tables, 4 algorithms.

Figures (4)

  • Figure 1: Motivating example to count the number of years that are even but not a multiple of four, inspired by the concept of leap year.
  • Figure 2: (a) CFG and (b) fault sites of the motivating example from Fig. \ref{['fig:mot:src']}. With fault sites, the x-axis presents data points (variables). The y-axis refers to program points (instructions), which are labeled by their corresponding instructions from the CFG (labels $p_0$--$p_{10}$). Live fault sites are data and program points that contain live values, depicted by white boxes with known bit values. Boxes are colored if the fault sites are identified as subjects for vulnerability tests by value-level analysis (red) or bit-level analysis (orange). The right half of the figure (c,d) depicts the motivating example after instruction rescheduling to minimize the live fault sites in bits. Note that the number of instructions to be executed and the number of fault injection runs required remain unchanged after bit-level vulnerability-aware instruction scheduling, at a reduction of the number of live fault sites by 15.4%.
  • Figure 3: Bit-level analysis: (a) lattice representation of bit values, (b) meet operator, and (c) bit-wise $\mathop{\mathrm{and}}\nolimits$ operator.
  • Figure 4: Iterative fault index coalescing of a fork-after-join CFG snippet using 4-bit data points: (a) initial fault indices are assigned to bits of data points, (b) fault indices are coalesced within their instruction during the intra-instruction fault-index coalescing phase, and (c) fault indices are coalesced across instructions during the inter-instruction fault index coalescing phase. Note that coalescing is a monotonic process that is performed backward along the dependency edges. The example code is in SSA form for brevity, but the proposed method is not limited to SSA form.