Table of Contents
Fetching ...

Accuracy is Not All You Need

Abhinav Dutta, Sanjeev Krishnan, Nipun Kwatra, Ramachandran Ramjee

TL;DR

This work argues that accuracy is insufficient to judge compressed LLMs, revealing a flips phenomenon where many individual answers change despite similar aggregate performance. Through extensive experiments across multiple quantization schemes and compression techniques, the authors show that end-user behavior can diverge significantly from the baseline even when accuracy remains near each other, and that KL-divergence correlates with these flips. MT-Bench results and qualitative analyses further demonstrate degradation in free-form generation that accuracy metrics miss, especially for larger models. The paper advocates incorporating distance metrics like flips and KL-divergence into standard evaluation pipelines, outlining practical implications for deploying compressed LLMs in real-world settings and providing a direction for more faithful end-user evaluation of model compression.

Abstract

When Large Language Models (LLMs) are compressed using techniques such as quantization, the predominant way to demonstrate the validity of such techniques is by measuring the model's accuracy on various benchmarks.If the accuracies of the baseline model and the compressed model are close, it is assumed that there was negligible degradation in quality.However, even when the accuracy of baseline and compressed model are similar, we observe the phenomenon of flips, wherein answers change from correct to incorrect and vice versa in proportion.We conduct a detailed study of metrics across multiple compression techniques, models and datasets, demonstrating that the behavior of compressed models as visible to end-users is often significantly different from the baseline model, even when accuracy is similar.We further evaluate compressed models qualitatively and quantitatively using MT-Bench and show that compressed models are significantly worse than baseline models in this free-form generative task.Thus, we argue that compression techniques should also be evaluated using distance metrics.We propose two such metrics, KL-Divergence and flips, and show that they are well correlated.

Accuracy is Not All You Need

TL;DR

This work argues that accuracy is insufficient to judge compressed LLMs, revealing a flips phenomenon where many individual answers change despite similar aggregate performance. Through extensive experiments across multiple quantization schemes and compression techniques, the authors show that end-user behavior can diverge significantly from the baseline even when accuracy remains near each other, and that KL-divergence correlates with these flips. MT-Bench results and qualitative analyses further demonstrate degradation in free-form generation that accuracy metrics miss, especially for larger models. The paper advocates incorporating distance metrics like flips and KL-divergence into standard evaluation pipelines, outlining practical implications for deploying compressed LLMs in real-world settings and providing a direction for more faithful end-user evaluation of model compression.

Abstract

When Large Language Models (LLMs) are compressed using techniques such as quantization, the predominant way to demonstrate the validity of such techniques is by measuring the model's accuracy on various benchmarks.If the accuracies of the baseline model and the compressed model are close, it is assumed that there was negligible degradation in quality.However, even when the accuracy of baseline and compressed model are similar, we observe the phenomenon of flips, wherein answers change from correct to incorrect and vice versa in proportion.We conduct a detailed study of metrics across multiple compression techniques, models and datasets, demonstrating that the behavior of compressed models as visible to end-users is often significantly different from the baseline model, even when accuracy is similar.We further evaluate compressed models qualitatively and quantitatively using MT-Bench and show that compressed models are significantly worse than baseline models in this free-form generative task.Thus, we argue that compression techniques should also be evaluated using distance metrics.We propose two such metrics, KL-Divergence and flips, and show that they are well correlated.
Paper Structure (19 sections, 10 figures, 19 tables)

This paper contains 19 sections, 10 figures, 19 tables.

Figures (10)

  • Figure 1: All six quantization schemes show negligible difference in accuracy compared to baseline 16-bit model in seven different tasks. However, all schemes, except GPTQ W8A16 (8-bit weight, 16-bit activation), exhibit large number of flips, indicating severe divergence in model behavior.
  • Figure 2: MMLU 5-shot accuracy difference and flips for two compression techniques (Llama2-13b model). Even at early stages of pruning with no accuracy difference, flips indicate model divergence.
  • Figure 3: When the Top Margin is low, answer will more likely change (Llama2-70b, BnB W4A4, MMLU 5-shot)
  • Figure 4: When the Top Margin is low, answer will more likely be incorrect (Llama2-70b, MMLU 5-shot)
  • Figure 5: Flips and KL Divergence are well correlated. Each point corresponds to a model, quantization combination in Table \ref{['tab:mmlu5shot']}
  • ...and 5 more figures