Table of Contents
Fetching ...

Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning

Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Kumar Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, Shiwei Liu

TL;DR

This work questions the prevailing assumption that C4 is the optimal calibration data for pruning large language models. By systematically evaluating calibration data from multiple pre-training sources (C4, Pile, OSCAR, RedPajama) and downstream formats (Zero-shot, In-Context Learning, In-Context Learning with Chain-of-Thought) across nine downstream tasks, using Wanda and SparseGPT on Llama models, it demonstrates that the C4 dataset is not universally best and that the Pile dataset often yields superior pruning performance. It also reveals that downstream calibration data, particularly arithmetic-focused datasets, can match or surpass pre-training data, and that ICL broadly benefits pruning while CoT benefits are task-dependent. The findings offer practical guidance for calibrating data to achieve more efficient and effective sparse LLMs and include code for reproducibility. Overall, calibration-data choice materially shapes pruning outcomes and should be tailored to the deployment context.

Abstract

Network pruning has emerged as a potential solution to make LLMs cheaper to deploy. However, existing LLM pruning approaches universally rely on the C4 dataset as the calibration data for calculating pruning scores, leaving its optimality unexplored. In this study, we evaluate the choice of calibration data on LLM pruning, across a wide range of datasets that are most commonly used in LLM training and evaluation, including four pertaining datasets as well as three categories of downstream tasks encompassing nine datasets. Each downstream dataset is prompted with In-Context Learning (ICL) and Chain-of-Thought (CoT), respectively. Besides the already intriguing observation that the choice of calibration data significantly impacts the performance of pruned LLMs, our results also uncover several subtle and often unexpected findings, summarized as follows: (1) C4 is not the optimal choice for LLM pruning, even among commonly used pre-training datasets; (2) arithmetic datasets, when used as calibration data, performs on par or even better than pre-training datasets; (3) pruning with downstream datasets does not necessarily help the corresponding downstream task, compared to pre-training data; (4) ICL is widely beneficial to all data categories, whereas CoT is only useful on certain tasks. Our findings shed light on the importance of carefully selecting calibration data for LLM pruning and pave the way for more efficient deployment of these powerful models in real-world applications. We release our code at: https://github.com/abx393/llm-pruning-calibration-data.

Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning

TL;DR

This work questions the prevailing assumption that C4 is the optimal calibration data for pruning large language models. By systematically evaluating calibration data from multiple pre-training sources (C4, Pile, OSCAR, RedPajama) and downstream formats (Zero-shot, In-Context Learning, In-Context Learning with Chain-of-Thought) across nine downstream tasks, using Wanda and SparseGPT on Llama models, it demonstrates that the C4 dataset is not universally best and that the Pile dataset often yields superior pruning performance. It also reveals that downstream calibration data, particularly arithmetic-focused datasets, can match or surpass pre-training data, and that ICL broadly benefits pruning while CoT benefits are task-dependent. The findings offer practical guidance for calibrating data to achieve more efficient and effective sparse LLMs and include code for reproducibility. Overall, calibration-data choice materially shapes pruning outcomes and should be tailored to the deployment context.

Abstract

Network pruning has emerged as a potential solution to make LLMs cheaper to deploy. However, existing LLM pruning approaches universally rely on the C4 dataset as the calibration data for calculating pruning scores, leaving its optimality unexplored. In this study, we evaluate the choice of calibration data on LLM pruning, across a wide range of datasets that are most commonly used in LLM training and evaluation, including four pertaining datasets as well as three categories of downstream tasks encompassing nine datasets. Each downstream dataset is prompted with In-Context Learning (ICL) and Chain-of-Thought (CoT), respectively. Besides the already intriguing observation that the choice of calibration data significantly impacts the performance of pruned LLMs, our results also uncover several subtle and often unexpected findings, summarized as follows: (1) C4 is not the optimal choice for LLM pruning, even among commonly used pre-training datasets; (2) arithmetic datasets, when used as calibration data, performs on par or even better than pre-training datasets; (3) pruning with downstream datasets does not necessarily help the corresponding downstream task, compared to pre-training data; (4) ICL is widely beneficial to all data categories, whereas CoT is only useful on certain tasks. Our findings shed light on the importance of carefully selecting calibration data for LLM pruning and pave the way for more efficient deployment of these powerful models in real-world applications. We release our code at: https://github.com/abx393/llm-pruning-calibration-data.

Paper Structure

This paper contains 15 sections, 1 figure, 9 tables.

Figures (1)

  • Figure 1: Examples of various calibration data formats examined in this paper.