Table of Contents
Fetching ...

ReaLM: Reliable and Efficient Large Language Model Inference with Statistical Algorithm-Based Fault Tolerance

Tong Xie, Jiawang Zhao, Zishen Wan, Zuodong Zhang, Yuan Wang, Runsheng Wang, Ru Huang, Meng Li

TL;DR

ReaLM tackles the reliability-efficiency gap in LLM inference on accelerators by first mapping the fault resilience of LLMs through a large-scale error-injection study, then introducing a statistical ABFT that adaptively protects network components with a low-cost error-detection circuit. By leveraging the inherent resilience variations in LLMs, ReaLM enables near-threshold voltage operation and significantly reduces recovery costs while preserving model performance, achieving up to $35.83\%$ energy savings and reducing perplexity degradation from $18.54$ to $0.29$. The approach integrates with systolic-array GEMM processing, supporting both weight-stationary and output-stationary dataflow, and demonstrates minimal area and power overhead (about $1.4\%$ each) across configurations. The work highlights the critical role of normalization-driven vulnerabilities in LLMs and provides a practical, co-design framework for reliable and efficient LLM inference.

Abstract

The demand for efficient large language model (LLM) inference has propelled the development of dedicated accelerators. As accelerators are vulnerable to hardware faults due to aging, variation, etc, existing accelerator designs often reserve a large voltage margin or leverage algorithm-based fault tolerance (ABFT) techniques to ensure LLM inference correctness. However, previous methods often overlook the inherent fault tolerance of LLMs, leading to high computation and energy overhead. To enable reliable yet efficient LLM inference, in this paper, we propose a novel algorithm/circuit co-design framework, dubbed ReaLM. For the first time, we systematically characterize the fault tolerance of LLMs by performing a large-scale error injection study of representative LLMs and natural language understanding tasks. Then, we propose a statistical ABFT algorithm that fully leverages the error robustness to minimize error recovery as much as possible. We also customize the error detection circuits to enable a low-cost online collection of error statistics. Extensive experiments show that with only 1.42% circuit area and 1.79% power overhead, our ReaLM can reduce perplexity degradation from 18.54 to 0.29. Compared to existing methods, ReaLM consistently reduces recovery costs across different operating voltages and improves energy efficiency by up to 35.83% without compromising LLM performance. Our error injection code is available at https://github.com/PKU-SEC-Lab/ReaLM_DAC25/

ReaLM: Reliable and Efficient Large Language Model Inference with Statistical Algorithm-Based Fault Tolerance

TL;DR

ReaLM tackles the reliability-efficiency gap in LLM inference on accelerators by first mapping the fault resilience of LLMs through a large-scale error-injection study, then introducing a statistical ABFT that adaptively protects network components with a low-cost error-detection circuit. By leveraging the inherent resilience variations in LLMs, ReaLM enables near-threshold voltage operation and significantly reduces recovery costs while preserving model performance, achieving up to energy savings and reducing perplexity degradation from to . The approach integrates with systolic-array GEMM processing, supporting both weight-stationary and output-stationary dataflow, and demonstrates minimal area and power overhead (about each) across configurations. The work highlights the critical role of normalization-driven vulnerabilities in LLMs and provides a practical, co-design framework for reliable and efficient LLM inference.

Abstract

The demand for efficient large language model (LLM) inference has propelled the development of dedicated accelerators. As accelerators are vulnerable to hardware faults due to aging, variation, etc, existing accelerator designs often reserve a large voltage margin or leverage algorithm-based fault tolerance (ABFT) techniques to ensure LLM inference correctness. However, previous methods often overlook the inherent fault tolerance of LLMs, leading to high computation and energy overhead. To enable reliable yet efficient LLM inference, in this paper, we propose a novel algorithm/circuit co-design framework, dubbed ReaLM. For the first time, we systematically characterize the fault tolerance of LLMs by performing a large-scale error injection study of representative LLMs and natural language understanding tasks. Then, we propose a statistical ABFT algorithm that fully leverages the error robustness to minimize error recovery as much as possible. We also customize the error detection circuits to enable a low-cost online collection of error statistics. Extensive experiments show that with only 1.42% circuit area and 1.79% power overhead, our ReaLM can reduce perplexity degradation from 18.54 to 0.29. Compared to existing methods, ReaLM consistently reduces recovery costs across different operating voltages and improves energy efficiency by up to 35.83% without compromising LLM performance. Our error injection code is available at https://github.com/PKU-SEC-Lab/ReaLM_DAC25/

Paper Structure

This paper contains 28 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: (a) Lower operating voltages increase BER, leading to significant perplexity degradation without protection. (b) Leveraging model resilience can reduce recovery costs. BERs are synthesized on an SA with commercial 14nm PDK (nominal voltage: 0.9V), aligning with prior studies ernst2003razorzhang2023readwan2024mulberry. Perplexity is evaluated using OPT-1.3B on WikiText-2 dataset.
  • Figure 2: Transformer blocks of (a) OPT and (b) LLaMA.
  • Figure 3: (a) Principle of ABFT. The checksums are compared to detect errors and capture error statistics. (b) Implementation of ABFT on SA bal2023novel.
  • Figure 4: Q1.1: (a)(b) Layer-wise resilience of different LLMs on different tasks. Q1.2: Bit-wise error resilience. (c) Error injection on K. (d) Error injection on O. Q1.3: (e)(f) Sensitivity to errors in different LLM components. Q1.4: Relationship between error frequency and magnitude. (g) Resilient components like K. (h) Sensitive components like O. Given MSD, the error magnitude decreases as the error frequency increases. Q2.1: (i)(j) Comparison between the prefill stage and decode stage. Q2.2: (k)(l) Impact of error injection across network components: O. and Down remain highly sensitive. (a)(c)(e)(g)(h) are evaluated with OPT-1.3B on LAMDABA; (b)(d)(f) with LLaMA-2-7B on WikiText-2; (i)(k) with LLaMA-2-7B on X-Sum; (j)(l) with LLaMA-2-7B on GSM8K.
  • Figure 5: (a) Data distribution of the pre-norm layer in LLMs, where outliers dominate $\mu$ and $\sigma$. Injecting larger errors can cause significant skew. (b) Data distribution after normalization is largely affected by the injected error.
  • ...and 5 more figures