Table of Contents
Fetching ...

RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs

Youngcheon You, Banseok Lee, Minseop Choi, Seonyoung Kim, Hyochan Chong, Changdong Kim, Youngmin Kim, Dongkyu Kim

TL;DR

RaBiT tackles the accuracy-efficiency gap in extreme 2-bit LLM quantization by identifying inter-path adaptation in residual binarization under QAT and resolving it with a coupled training strategy. It enforces a residual hierarchy by deriving each binary path from a single shared full-precision weight, promoting negative correlation between path outputs and stabilizing training through function-aware initialization. Empirically, RaBiT achieves state-of-the-art 2-bit performance across Llama2-7B, Llama3-8B, and Gemma3, while delivering up to $4.49\times$ end-to-end speed-up on RTX 4090 and halving optimizer memory footprint. This work narrows the long-standing gap between binary efficiency and hardware-intensive Vector Quantization, enabling practical, high-performance quantized LLMs on consumer hardware with strong zero-shot and reasoning capabilities.

Abstract

Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary ($\pm$1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during quantization-aware training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and limiting the expressive capacity of the model. While prior work relies on heuristic workarounds (e.g., path freezing) that constrain the solution space, we propose RaBiT, a novel quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, which ensures that every path corrects the error of the preceding one. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a $4.49\times$ inference speed-up over full-precision models on an RTX 4090.

RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs

TL;DR

RaBiT tackles the accuracy-efficiency gap in extreme 2-bit LLM quantization by identifying inter-path adaptation in residual binarization under QAT and resolving it with a coupled training strategy. It enforces a residual hierarchy by deriving each binary path from a single shared full-precision weight, promoting negative correlation between path outputs and stabilizing training through function-aware initialization. Empirically, RaBiT achieves state-of-the-art 2-bit performance across Llama2-7B, Llama3-8B, and Gemma3, while delivering up to end-to-end speed-up on RTX 4090 and halving optimizer memory footprint. This work narrows the long-standing gap between binary efficiency and hardware-intensive Vector Quantization, enabling practical, high-performance quantized LLMs on consumer hardware with strong zero-shot and reasoning capabilities.

Abstract

Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary (1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during quantization-aware training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and limiting the expressive capacity of the model. While prior work relies on heuristic workarounds (e.g., path freezing) that constrain the solution space, we propose RaBiT, a novel quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, which ensures that every path corrects the error of the preceding one. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a inference speed-up over full-precision models on an RTX 4090.
Paper Structure (52 sections, 4 theorems, 17 equations, 6 figures, 11 tables, 2 algorithms)

This paper contains 52 sections, 4 theorems, 17 equations, 6 figures, 11 tables, 2 algorithms.

Key Result

Proposition 1

In a Standard QAT scheme where two paths ($\hat{\mathbf{W}}_1, \hat{\mathbf{W}}_2$) are updated from their respective latent weights ($\mathbf{W}_1, \mathbf{W}_2$) using a shared global gradient $\mathbf{G} = \nabla_{\hat{\mathbf{W}}_1+\hat{\mathbf{W}}_2}\mathcal{L}$, the paths have a persistent ten

Figures (6)

  • Figure 1: Overview of the RaBiT training framework.(a) Dynamic Compute Process: During training, binary paths are dynamically derived from a shared weight $\mathbf{W}_{\mathrm{FP}}$ to enforce a residual hierarchy. (b) Forward Pass: For inference, these paths execute in parallel for matmul-free efficiency. (c) Backward Pass: Gradients from the loss $\mathcal{L}$ update both the learnable scales ($\mathbf{g}_i, \mathbf{h}_i$) and the shared $\mathbf{W}_{\mathrm{FP}}$.
  • Figure 2: Visualization of Coupled Training Dynamics.(a) Inter-Path Correlation: RaBiT enforces a negative inter-path correlation, indicating effective error-correction, whereas Standard QAT leads to positive correlation (co-adaptation). (b) Training Loss: This structural advantage directly translates to a lower and more stable training loss for RaBiT, demonstrating its superior optimization path.
  • Figure 3: Visual analysis of weight initialization for the first layer's q_proj matrix of the Llama2-7B model. The top row displays the original full-precision weight ($\mathbf{W}_{\mathrm{FP}}$) alongside its initial approximations from three methods: (b) Greedy SVID, (c) Iterative SVID, and (d) Iterative SVID with I/O Channel Importance Scaling. The bottom row shows the corresponding difference matrices ($\mathbf{W}_{\mathrm{FP}} - \hat{\mathbf{W}}_{\mathrm{init}}$), illustrating the initial error structure. Our function-aware initialization produces a qualitatively different structure compared to the others, which is reflected in its distinct error pattern.
  • Figure 4: End-to-end decoding throughput (tokens/second) for Llama2-7B on an NVIDIA RTX 4090 across various generated token lengths. RaBiT's parallel architecture consistently delivers superior performance over other 2-bit methods.
  • Figure 5: Convergence analysis of Iterative Residual SVID on Llama2-7B. The metric is the Initial KL Divergence Loss (Lower is better). Convergence stabilizes around 20 iterations.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Proposition 1: Inter-Path Adaptation in Standard QAT
  • proof
  • Proposition 2: Structurally Enforced Error Correction in RaBiT
  • proof : Analysis.
  • Proposition 3: Superior Optimization Dynamics of Coupled vs. Iterative Training
  • proof
  • proof : Analysis.
  • Proposition 4: Optimality of Residual Coupling under KL Divergence
  • proof : Analysis.