Table of Contents
Fetching ...

Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization

Björn Deiseroth, Max Meuer, Nikolas Gritsch, Constantin Eichenberg, Patrick Schramowski, Matthias Aßenmacher, Kristian Kersting

TL;DR

The paper addresses the challenge of evaluating compressed large language models by introducing Divergent Token Metrics (DTMs), including First Divergent Token Metric (FDTM) and Share of Divergent Tokens Metric (SDTM). These token-centric metrics align with the actual greedy sampling process and provide advantages over perplexity, enabling principled, component-wise sparsification and quantization on Llama-2 models. Empirical results show that 25% of attention components can be pruned beyond 90% sparsity and that around 80% of parameters can be naively quantized to int8 without severe degradation, underscoring the potential of DTMs to guide efficient compression. By directly measuring generation divergence, the approach improves upon traditional NLP benchmarks and supports targeted compression strategies with practical impact for deploying compact LLMs.

Abstract

Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. However, their ever-increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Divergent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs measure token divergences that allow deeper insights into the subtleties of model compression, in particular, when evaluating components' impacts individually. Utilizing the First Divergent Token Metric (FDTM) in model sparsification reveals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of parameters can be naively transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually -- and that FDTM can identify those -- while standard metrics result in deteriorated outcomes.

Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization

TL;DR

The paper addresses the challenge of evaluating compressed large language models by introducing Divergent Token Metrics (DTMs), including First Divergent Token Metric (FDTM) and Share of Divergent Tokens Metric (SDTM). These token-centric metrics align with the actual greedy sampling process and provide advantages over perplexity, enabling principled, component-wise sparsification and quantization on Llama-2 models. Empirical results show that 25% of attention components can be pruned beyond 90% sparsity and that around 80% of parameters can be naively quantized to int8 without severe degradation, underscoring the potential of DTMs to guide efficient compression. By directly measuring generation divergence, the approach improves upon traditional NLP benchmarks and supports targeted compression strategies with practical impact for deploying compact LLMs.

Abstract

Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. However, their ever-increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Divergent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs measure token divergences that allow deeper insights into the subtleties of model compression, in particular, when evaluating components' impacts individually. Utilizing the First Divergent Token Metric (FDTM) in model sparsification reveals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of parameters can be naively transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually -- and that FDTM can identify those -- while standard metrics result in deteriorated outcomes.
Paper Structure (22 sections, 2 theorems, 13 equations, 22 figures, 2 tables, 1 algorithm)

This paper contains 22 sections, 2 theorems, 13 equations, 22 figures, 2 tables, 1 algorithm.

Key Result

Proposition 3.2

Given any $y$, $N$ and $\varepsilon > 0$ there exist logits $l, l' \in {\mathbb R}^{N \times |\mathcal{V}|}$ such that

Figures (22)

  • Figure 1: Illustration of a diverging generation process. Given the 3-token prefix as prompt, a baseline and its compressed model generate 8 subsequent tokens. Our proposed metric points to the first divergent token (FDT). The FDT may cause further divergence during the iterative generation process. Note how both models score the same perplexity value, as it does not reflect the actual sampling process (cf. Fig. \ref{['fig:sparse_random']}, Sec. \ref{['sec:compression']} for an empirical exploration).
  • Figure 2: Pruning lowest weights, and random weights. FDT is able to discriminate the cases. PPL exactly performs on the level of guessing. Cf. Sec. \ref{['sec:sparse']}.
  • Figure 3: Hyperparameter selection of FDT. Visualized is the standard deviation (std) in FDT$_{75}$ over all components when varying prefix length (y-axis) and applying different choices for sparsity-step increases (x-axis), cf. Sec. \ref{['sec:exp']} and \ref{['sec:sparse']}.
  • Figure 4: Depiction of the proposed sparsification process that converged to a 75% sparse Llama-2-13B. a) Model training performance throughout all rounds. Our FDT-based sparsification clearly outperforms uniform magnitude pruning. b) Converged sparsity values per component. One quarter of attention components are pruned beyond 90% sparsity. Significant outliers appear in first and last layers.
  • Figure 5: Evaluation of the Tree Search as described in text. a) Comparison of Tree Search based componentwise quantization. Different numbers of components (x-axis) lead to different token divergence scores (y-axis, normalized to $[0,1]$), and in particular correlates early on to introduced outliers (second y-axis). Throughout the entire search, FDT is able to rank components by their potential errors and, coincidentally, outliers. b) Selected components at respective depth. A.Key and A.Value induce most error.
  • ...and 17 more figures

Theorems & Definitions (7)

  • Definition 3.1
  • Proposition 3.2
  • proof
  • Proposition 3.3
  • proof
  • proof : Proof of Proposition \ref{['prop:ppl_discontinuity']}
  • proof : Proof of Proposition \ref{['prop:upper_bound']}