Zero-Space Cost Fault Tolerance for Transformer-based Language Models on ReRAM

Bingbing Li; Geng Yuan; Zigeng Wang; Shaoyi Huang; Hongwu Peng; Payman Behnam; Wujie Wen; Hang Liu; Caiwen Ding

Zero-Space Cost Fault Tolerance for Transformer-based Language Models on ReRAM

Bingbing Li, Geng Yuan, Zigeng Wang, Shaoyi Huang, Hongwu Peng, Payman Behnam, Wujie Wen, Hang Liu, Caiwen Ding

TL;DR

This work tackles the problem of deploying Transformer-based language models on ReRAM where hard faults can severely distort weights during inference. It proposes a three-stage approach: (1) differentiable structured pruning to automatically create row/column sparsity and free space for backups, (2) MSB duplication with voting to tolerate stuck-at faults on the most impactful bits, and (3) embedding of duplicated MSBs into the pruned weight structure to avoid extra storage. Empirically, the method achieves over 30% sparsity with negligible accuracy loss across nine GLUE tasks using BERT-base, while maintaining zero storage overhead due to embedding, and demonstrates robustness to SAF defects via voting that tolerates up to ~2.5× higher fault rates. The results suggest practical, scalable fault-tolerant inference for ReRAM-based NLP accelerators, enabling robust performance with minimal hardware overhead.

Abstract

Resistive Random Access Memory (ReRAM) has emerged as a promising platform for deep neural networks (DNNs) due to its support for parallel in-situ matrix-vector multiplication. However, hardware failures, such as stuck-at-fault defects, can result in significant prediction errors during model inference. While additional crossbars can be used to address these failures, they come with storage overhead and are not efficient in terms of space, energy, and cost. In this paper, we propose a fault protection mechanism that incurs zero space cost. Our approach includes: 1) differentiable structure pruning of rows and columns to reduce model redundancy, 2) weight duplication and voting for robust output, and 3) embedding duplicated most significant bits (MSBs) into the model weight. We evaluate our method on nine tasks of the GLUE benchmark with the BERT model, and experimental results prove its effectiveness.

Zero-Space Cost Fault Tolerance for Transformer-based Language Models on ReRAM

TL;DR

Abstract

Paper Structure (16 sections, 7 equations, 7 figures, 3 tables, 2 algorithms)

This paper contains 16 sections, 7 equations, 7 figures, 3 tables, 2 algorithms.

Introduction
Background and limitation
ReRAM-based systems: Advantage and Limitations
Pruning for efficient ReRAM utilization
Fault-tolerated ReRAM
Stage 1: Differentiable structured pruning
Stage 2: Fault tolerance
Quantization and binarization
MSB duplication and result voting
Stage 3: Embedding MSB candidates for weight-crossbar mapping
Evaluation
Experiment settings
Results of differentiable structured pruning
Distribution of weight parameters
MSB candidates voting
...and 1 more sections

Figures (7)

Figure 1: Zero-space Cost Fault Tolerance on ReRAM.
Figure 2: Differentiable structured (e.g. column) pruning framework.
Figure 3: Weight distribution of the 1st layer weight matrix of BERT model: most values of the parameters are much smaller than the half of the weight maximum.
Figure 4: Duplication and output voting on ReRAM crossbars ($O_E$ is the expected output, $O_{Non}$ is the output without voting, $O_V$ is the output after voting, OPC module is the Output Peripheral Component).
Figure 5: MSB embedding to eliminate storage overhead.
...and 2 more figures

Zero-Space Cost Fault Tolerance for Transformer-based Language Models on ReRAM

TL;DR

Abstract

Zero-Space Cost Fault Tolerance for Transformer-based Language Models on ReRAM

Authors

TL;DR

Abstract

Table of Contents

Figures (7)