Table of Contents
Fetching ...

Tequila: Trapping-free Ternary Quantization for Large Language Models

Hong Huang, Decheng Wu, Rui Cen, Guanghua Yu, Zonghang Li, Kai Liu, Jianchen Zhu, Peng Chen, Xue Liu, Dapeng Wu

TL;DR

This work tackles the challenge of deploying large language models on edge devices by addressing deadzone trapping in aggressive ternary quantization. It introduces Tequila, a trapping-free quantization method that reactivates dead weights as adaptive biases through differentiable reactivation, offline bias precomputation, and hybrid weight-bias roles, preserving hardware efficiency. Empirical results on LLaMA-3.2 models across five benchmarks show Tequila outperforms state-of-the-art ternary methods by substantial margins (e.g., ARC gains above 4%) while achieving near full-precision accuracy and up to a 3x speedup on CPUs. The approach enables practical, efficient on-device LLM deployment with limited training data and minimal inference overhead, offering a scalable path for resource-constrained applications.

Abstract

Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as deadzone trapping: a large number of weights are trapped at the deadzone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose Tequila, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly zero inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves >4% accuracy gain over the SOTA baseline, nearly matching full-precision performance (within <1% gap) with a 3.0x inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. The code is available at https://github.com/Tencent/AngelSlim.

Tequila: Trapping-free Ternary Quantization for Large Language Models

TL;DR

This work tackles the challenge of deploying large language models on edge devices by addressing deadzone trapping in aggressive ternary quantization. It introduces Tequila, a trapping-free quantization method that reactivates dead weights as adaptive biases through differentiable reactivation, offline bias precomputation, and hybrid weight-bias roles, preserving hardware efficiency. Empirical results on LLaMA-3.2 models across five benchmarks show Tequila outperforms state-of-the-art ternary methods by substantial margins (e.g., ARC gains above 4%) while achieving near full-precision accuracy and up to a 3x speedup on CPUs. The approach enables practical, efficient on-device LLM deployment with limited training data and minimal inference overhead, offering a scalable path for resource-constrained applications.

Abstract

Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as deadzone trapping: a large number of weights are trapped at the deadzone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose Tequila, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly zero inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves >4% accuracy gain over the SOTA baseline, nearly matching full-precision performance (within <1% gap) with a 3.0x inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. The code is available at https://github.com/Tencent/AngelSlim.

Paper Structure

This paper contains 36 sections, 14 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (Top)Deadzone Trapping in Ternary Quantization: Dead weights are trapped in a cycle of ineffective oscillation around the deadzone boundary due to noisy and uninformative gradients, significantly impeding model capacity and optimization, causing a significant accuracy drop ($>5\%$) versus the full-precision. (Bottom)Reactivation Strategy of Tequila: Our Tequila reactivates dead weights as dynamic biases, providing direct and meaningful gradients for stable escapes, enhancing model capability and optimization, achieving only a minor accuracy gap ($<1\%$).
  • Figure 2: (a) Prior Ternary Quantization replaces multiplications with efficient additions but suffers from severe information loss and limited capacity due to deadzone-trapped weights. (b) Minima Reactivation assigns signed minima to dead weights, improving capacity but yielding only marginal accuracy gains. (c) Tequila reactivates dead weights as adaptive dynamic biases via a differentiable function, achieving significant accuracy improvements with nearly zero inference overhead. For simplicity, we omit the scaling operation in the Figure.
  • Figure 3: Evaluation of Tequila on convergence speed compared to SOTA ternary quantization.
  • Figure 4: Inference speed of TequilaLLM versus BF16 LLaMA and ternary BitNet.
  • Figure 5: Ablation study comparing Tequila against its variants.
  • ...and 3 more figures