Table of Contents
Fetching ...

Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization

Bowei He, Lihao Yin, Huiling Zhen, Shuqi Liu, Han Wu, Xiaokun Zhang, Mingxuan Yuan, Chen Ma

TL;DR

The paper addresses how calibration data affects LLM preservation under post-training compression and introduces COLA, a three-stage calibration data curation framework that maximizes activation-space representativeness and diversity. By systematically varying calibration data properties and analyzing activation patterns, it demonstrates that data domain correspondence and compositional properties critically influence high-level capabilities like math and code. COLA formalizes dataset selection, processing, and sample selection (via activation-space clustering) to optimize preservation across pruning and quantization methods, with empirical gains across LLaMA3-8B, Qwen2.5-7B, and larger models. The work advances practical compression pipelines by providing a principled, activation-focused approach to calibration data, validated through extensive experiments and spectral analyses that link calibration data quality to preserved capability spectra.

Abstract

Post-training compression has been a widely employed approach to scale down large language model (LLM) and facilitate efficient inference. In various proposed compression methods, including pruning and quantization, calibration data plays a vital role by informing the weight importance and activation dynamic ranges. However, how calibration data impacts the LLM capability after compression is less explored. Few of the existing works, though recognizing the significance of this study, only investigate the language modeling or commonsense reasoning performance degradation from limited angles, like the data sources or sample amounts. More systematic research is still needed to examine the impacts on different LLM capabilities in terms of compositional properties and domain correspondence of calibration data. In this work, we aim at bridging this gap and further analyze underlying influencing mechanisms from the activation pattern perspective. Especially, we explore the calibration data's impacts on high-level complex reasoning capabilities, like math problem solving and code generation. Delving into the underlying mechanism, we find that the representativeness and diversity in activation space more fundamentally determine the quality of calibration data. Finally, we propose a calibration data curation framework based on such observations and analysis, enhancing the performance of existing post-training compression methods on preserving critical LLM capabilities. Our code is provided in \href{https://github.com/BokwaiHo/COLA.git}{Link}.

Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization

TL;DR

The paper addresses how calibration data affects LLM preservation under post-training compression and introduces COLA, a three-stage calibration data curation framework that maximizes activation-space representativeness and diversity. By systematically varying calibration data properties and analyzing activation patterns, it demonstrates that data domain correspondence and compositional properties critically influence high-level capabilities like math and code. COLA formalizes dataset selection, processing, and sample selection (via activation-space clustering) to optimize preservation across pruning and quantization methods, with empirical gains across LLaMA3-8B, Qwen2.5-7B, and larger models. The work advances practical compression pipelines by providing a principled, activation-focused approach to calibration data, validated through extensive experiments and spectral analyses that link calibration data quality to preserved capability spectra.

Abstract

Post-training compression has been a widely employed approach to scale down large language model (LLM) and facilitate efficient inference. In various proposed compression methods, including pruning and quantization, calibration data plays a vital role by informing the weight importance and activation dynamic ranges. However, how calibration data impacts the LLM capability after compression is less explored. Few of the existing works, though recognizing the significance of this study, only investigate the language modeling or commonsense reasoning performance degradation from limited angles, like the data sources or sample amounts. More systematic research is still needed to examine the impacts on different LLM capabilities in terms of compositional properties and domain correspondence of calibration data. In this work, we aim at bridging this gap and further analyze underlying influencing mechanisms from the activation pattern perspective. Especially, we explore the calibration data's impacts on high-level complex reasoning capabilities, like math problem solving and code generation. Delving into the underlying mechanism, we find that the representativeness and diversity in activation space more fundamentally determine the quality of calibration data. Finally, we propose a calibration data curation framework based on such observations and analysis, enhancing the performance of existing post-training compression methods on preserving critical LLM capabilities. Our code is provided in \href{https://github.com/BokwaiHo/COLA.git}{Link}.

Paper Structure

This paper contains 44 sections, 5 equations, 15 figures, 17 tables.

Figures (15)

  • Figure 1: Calibration data in LLM compression.
  • Figure 2: Impact of calibration data sequence length on capability preservation across different compression methods for LLaMA3-8B and Qwen2.5-7B. Note the varying sensitivity to sequence length across capabilities and compression methods.
  • Figure 3: Effect of calibration sample amount on capability preservation for LLaMA3-8B and Qwen2.5-7B models. Note the diminishing returns beyond 64-128 samples for most capabilities, with AWQ showing particular robustness to small sample sizes.
  • Figure 4: Impact of calibration data sources on capability preservation for LLaMA3-8B and Qwen2.5-7B. Note the significant advantage of C4 for code generation tasks, while Wikipedia provides more balanced performance across capabilities.
  • Figure 5: Impact of calibration data format on compressed LLM capability preservation.
  • ...and 10 more figures