Adaptive Dataset Quantization: A New Direction for Dataset Pruning
Chenyue Yu, Jianyu Yu
TL;DR
This work introduces Adaptive Dataset Quantization (ADQ), a preprocessing method to compress datasets by reducing intra-sample content while preserving training performance. It combines per-sample linear symmetric quantization with an adaptive bit-width allocation that assigns higher precision to more quantization-sensitive samples, maintaining a fixed overall compression ratio. Through extensive experiments on CIFAR-10/100 and ImageNet-1K, ADQ outperforms traditional dataset pruning and distillation baselines under equivalent compression, and even yields slight gains over full-precision training at moderate compression. The approach significantly boosts data efficiency for edge deployments and opens avenues for integrating dataset and model quantization in a unified efficiency framework.
Abstract
This paper addresses the challenges of storage and communication costs for large-scale datasets in resource-constrained edge devices by proposing a novel dataset quantization approach to reduce intra-sample redundancy. Unlike traditional dataset pruning and distillation methods that focus on inter-sample redundancy, the proposed method compresses each image by reducing redundant or less informative content within samples while preserving essential features. It first applies linear symmetric quantization to obtain an initial quantization range and scale for each sample. Then, an adaptive quantization allocation algorithm is introduced to distribute different quantization ratios for samples with varying precision requirements, maintaining a constant total compression ratio. The main contributions include: (1) being the first to use limited bits to represent datasets for storage reduction; (2) introducing a dataset-level quantization algorithm with adaptive ratio allocation; and (3) validating the method's effectiveness through extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K. Results show that the method maintains model training performance while achieving significant dataset compression, outperforming traditional quantization and dataset pruning baselines under the same compression ratios.
