Table of Contents
Fetching ...

Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models

Wanyun Cui, Qianle Wang

TL;DR

This work reveals a pervasive parameter heterogeneity in large language models, where a tiny subset of 'cherry' parameters disproportionately impacts performance while most parameters tolerate quantization with little loss. It introduces CherryQ, a quantization framework that end-to-end optimizes mixed precisions by preserving cherry parameters in high precision and quantizing the rest, using a heterogeneity-based criterion to identify cherry parameters. Through extensive experiments across base and chat LLMs, CherryQ achieves superior perplexity and downstream task performance with 3-bit quantization, and even competitive results relative to 16-bit baselines for Vicuna-1.5. The approach reduces memory and compute requirements for deployment while maintaining model quality, with strong performance gains in ultra-low precision regimes and robust results across datasets and tasks.

Abstract

This paper reveals the phenomenon of parameter heterogeneity in large language models (LLMs). We find that a small subset of "cherry" parameters exhibit a disproportionately large influence on model performance, while the vast majority of parameters have minimal impact. This heterogeneity is found to be prevalent across different model families, scales, and types. Motivated by this observation, we propose CherryQ, a novel quantization method that unifies the optimization of mixed-precision parameters. CherryQ identifies and preserves the critical cherry parameters in high precision while aggressively quantizing the remaining parameters to low precision. Extensive experiments demonstrate the effectiveness of CherryQ. CherryQ outperforms existing quantization approaches in terms of perplexity and downstream task performance. Notably, our 3-bit quantized Vicuna-1.5 exhibits competitive performance compared to their 16-bit counterparts.

Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models

TL;DR

This work reveals a pervasive parameter heterogeneity in large language models, where a tiny subset of 'cherry' parameters disproportionately impacts performance while most parameters tolerate quantization with little loss. It introduces CherryQ, a quantization framework that end-to-end optimizes mixed precisions by preserving cherry parameters in high precision and quantizing the rest, using a heterogeneity-based criterion to identify cherry parameters. Through extensive experiments across base and chat LLMs, CherryQ achieves superior perplexity and downstream task performance with 3-bit quantization, and even competitive results relative to 16-bit baselines for Vicuna-1.5. The approach reduces memory and compute requirements for deployment while maintaining model quality, with strong performance gains in ultra-low precision regimes and robust results across datasets and tasks.

Abstract

This paper reveals the phenomenon of parameter heterogeneity in large language models (LLMs). We find that a small subset of "cherry" parameters exhibit a disproportionately large influence on model performance, while the vast majority of parameters have minimal impact. This heterogeneity is found to be prevalent across different model families, scales, and types. Motivated by this observation, we propose CherryQ, a novel quantization method that unifies the optimization of mixed-precision parameters. CherryQ identifies and preserves the critical cherry parameters in high precision while aggressively quantizing the remaining parameters to low precision. Extensive experiments demonstrate the effectiveness of CherryQ. CherryQ outperforms existing quantization approaches in terms of perplexity and downstream task performance. Notably, our 3-bit quantized Vicuna-1.5 exhibits competitive performance compared to their 16-bit counterparts.
Paper Structure (19 sections, 6 equations, 3 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 6 equations, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: Scatter plot of parameter impacts in different LLMs. We randomly sampled 4096 parameters from the corresponding parameter matrix. Each point represents the impact of an individual parameter. Insets show the zoomed-in y-axis. The heterogeneity is found across different model scales (\ref{['fig:param_heterogeneity:llama7b']},\ref{['fig:param_heterogeneity:llama13b']}), different model families (\ref{['fig:param_heterogeneity:mistral7b']}, \ref{['fig:param_heterogeneity:gemma7b']}), and both base models and chat models (\ref{['fig:param_heterogeneity:vicuna7b']}, \ref{['fig:param_heterogeneity:vicuna13b']}).
  • Figure 2: Scatter distribution of heterogeneity scores for different parameter matrices in LLMs. Each point represents a parameter matrix.
  • Figure 3: Comparison of 3-bit quantized models to FP16 Vicuna-1.5. (Left) Comparisons to Vicuna-1.5-7B. (Right) Comparisons to Vicuna-1.5-13B. CherryQ even shows competitive quality compared to the 16-bit counterpart.