Table of Contents
Fetching ...

LM-Fix: Lightweight Bit-Flip Detection and Rapid Recovery Framework for Language Models

Ahmad Tahmasivand, Noureldin Zahran, Saba Al-Sayouri, Mohammed Fouda, Khaled N. Khasawneh

TL;DR

LM-Fix addresses the reliability gap in large language models by providing a lightweight, architecture-aware integrity layer that detects and recovers from bit-flip faults in weights. It introduces a model-native Hooked Tensor Auditing detector that compares a fixed test-vector output to a reference to identify corruptions quickly, without relying on semantic task results. The recovery pipeline localizes faults through cache clearing, layer-wise and parameter-level localization, and reconstructs corrupted parameters with a compact redundancy scheme using integer-weight representations, avoiding full model reloads. Across multiple models and fault scenarios, LM-Fix achieves high detection coverage and substantial speedups in recovery, enabling reliable, low-latency inference in edge and data-center deployments.

Abstract

This paper presents LM-Fix, a lightweight detection and rapid recovery framework for faults in large language models (LLMs). Existing integrity approaches are often heavy or slow for modern LLMs. LM-Fix runs a short test-vector pass and uses hash-guided checks to detect bit-flip faults, then repairs them locally without a full reload. Across multiple models, it detects over 94% of single-bit flips at TVL=200 and nearly 100% of multi-bit flips with approximately 1% to 7.7% runtime overhead; recovery is more than 100x faster than reloading. These results show a practical, low-overhead solution to keep LLMs reliable in production

LM-Fix: Lightweight Bit-Flip Detection and Rapid Recovery Framework for Language Models

TL;DR

LM-Fix addresses the reliability gap in large language models by providing a lightweight, architecture-aware integrity layer that detects and recovers from bit-flip faults in weights. It introduces a model-native Hooked Tensor Auditing detector that compares a fixed test-vector output to a reference to identify corruptions quickly, without relying on semantic task results. The recovery pipeline localizes faults through cache clearing, layer-wise and parameter-level localization, and reconstructs corrupted parameters with a compact redundancy scheme using integer-weight representations, avoiding full model reloads. Across multiple models and fault scenarios, LM-Fix achieves high detection coverage and substantial speedups in recovery, enabling reliable, low-latency inference in edge and data-center deployments.

Abstract

This paper presents LM-Fix, a lightweight detection and rapid recovery framework for faults in large language models (LLMs). Existing integrity approaches are often heavy or slow for modern LLMs. LM-Fix runs a short test-vector pass and uses hash-guided checks to detect bit-flip faults, then repairs them locally without a full reload. Across multiple models, it detects over 94% of single-bit flips at TVL=200 and nearly 100% of multi-bit flips with approximately 1% to 7.7% runtime overhead; recovery is more than 100x faster than reloading. These results show a practical, low-overhead solution to keep LLMs reliable in production

Paper Structure

This paper contains 20 sections, 6 equations, 7 figures, 2 tables, 2 algorithms.

Figures (7)

  • Figure 1: LM-Fix Framework Overview
  • Figure 2: Column Search
  • Figure 3: Row Search
  • Figure 4: Detection improves with multiple bit-flips
  • Figure 5: Silent Safe Bit-Flips location in parameters
  • ...and 2 more figures