TernaryLLM: Ternarized Large Language Model

Tianqi Chen; Zhe Li; Weixiang Xu; Zeyu Zhu; Dong Li; Lu Tian; Emad Barsoum; Peisong Wang; Jian Cheng

TernaryLLM: Ternarized Large Language Model

Tianqi Chen, Zhe Li, Weixiang Xu, Zeyu Zhu, Dong Li, Lu Tian, Emad Barsoum, Peisong Wang, Jian Cheng

TL;DR

This work introduces Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable and proposes Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization.

Abstract

Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks, but they are hindered by high computational costs and memory requirements. Ternarization, an extreme form of quantization, offers a solution by reducing memory usage and enabling energy-efficient floating-point additions. However, applying ternarization to LLMs faces challenges stemming from outliers in both weights and activations. In this work, observing asymmetric outliers and non-zero means in weights, we introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable. We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization. The proposed OFF can incorporate semantic information and is insensitive to outliers. At the core of OFF is maximizing the mutual information between features in ternarized and floating-point models using cosine similarity. Extensive experiments demonstrate that our TernaryLLM surpasses previous low-bit quantization methods on the standard text generation and zero-shot benchmarks for different LLM families. Specifically, for one of the most powerful open-source models, LLaMA-3, our approach (W1.58A16) outperforms the previous state-of-the-art method (W2A16) by 5.8 in terms of perplexity on C4 and by 8.2% in terms of average accuracy on zero-shot tasks.

TernaryLLM: Ternarized Large Language Model

TL;DR

Abstract

Paper Structure (19 sections, 3 theorems, 18 equations, 5 figures, 5 tables)

This paper contains 19 sections, 3 theorems, 18 equations, 5 figures, 5 tables.

Introduction
Related Work
LLM Quantization
Knowledge Distillation
Background
Challenges of Ternarizing LLMs
Method
Dual Learnable Ternarization
Outlier-Friendly Feature Knowledge Distillation
Experiments
Experiment Setup
Results on Language Generation
Results on Zero-Shot Tasks
Ablations
Conclusion
...and 4 more sections

Key Result

Theorem 1

Assume $x \in \mathbb{R}^{C_i} \sim \mathcal{N}(0, \sigma_x^2)$, $W = (w_1^T, w_2^T, \ldots, w_{C_o})^T$ and $y = \text{RMSNorm}(Wx)$. Let $W_q$ denotes the ternarization of $W$ and $y_q = \text{RMSNorm} (W_q x)$. The objective to maximize the mutual information between $y$ and $y_q$$I(y, y_q)$ can

Figures (5)

Figure 1: An example of the features in the 23rd decoder layer to illustrate the problems incurred by extreme low-bit quantization. The first and second lines correspond to the float-point and quantized models, respectively. Extreme low-bit quantization leads to severe information loss in pretrained LLMs, including a narrowed feature representation range (Figure \ref{['challenge2_fig']} (a)), loss of prominence in dominant channels (Figure \ref{['challenge2_fig']} (b)), and disruption of the semantic clustering of related words (Figure \ref{['challenge2_fig']} (c) and (d)).
Figure 2: The weights in certain groups display noticeable asymmetric outliers and a non-zero mean distribution.
Figure 3: Feature knowledge distillation results for LLaMA-1-7B. Cosine similarity is less sensitive to outliers in features compared to MSE. (a) Ground truth loss of the training. (b) Feature knowledge distillation loss of the training. (c) The reasons for severe oscillations in MSE distillation.
Figure 4: Training loss and validation perplexity curves. The experiments are conducted on OPT-125M with a group size of 128. Our method surpasses OmniQuant with only 500 steps.
Figure 5: Comparison of different knowledge distillation techniques on the LLaMA-1-7B model. OFF and logits KD, either separately or combined, can improve performance.

Theorems & Definitions (5)

Theorem 1
Theorem 1
proof
Theorem 2
proof

TernaryLLM: Ternarized Large Language Model

TL;DR

Abstract

TernaryLLM: Ternarized Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (5)