Dataset Quantization with Active Learning based Adaptive Sampling
Zhenghao Zhao, Yuzhang Shang, Junyi Wu, Yan Yan
TL;DR
This work tackles data efficiency in deep learning by showing that class-wise sensitivity to sample quantity varies significantly across dataset compression methods. It introduces Dataset Quantization with Active Learning based Adaptive Sampling (DQAS), combining class-wise initialization and error-reduction-driven active learning with a patchified-image-aware quantization pipeline to adapt sampling and maintain consistent feature representations. Empirical results on CIFAR-10, CIFAR-100, and Tiny ImageNet show that DQAS outperforms state-of-the-art dataset compression methods, especially at low sampling ratios, achieving strong performance with reduced data. The proposed approach offers a practical framework for imbalanced dataset compression and efficient data management in deep learning pipelines.
Abstract
Deep learning has made remarkable progress recently, largely due to the availability of large, well-labeled datasets. However, the training on such datasets elevates costs and computational demands. To address this, various techniques like coreset selection, dataset distillation, and dataset quantization have been explored in the literature. Unlike traditional techniques that depend on uniform sample distributions across different classes, our research demonstrates that maintaining performance is feasible even with uneven distributions. We find that for certain classes, the variation in sample quantity has a minimal impact on performance. Inspired by this observation, an intuitive idea is to reduce the number of samples for stable classes and increase the number of samples for sensitive classes to achieve a better performance with the same sampling ratio. Then the question arises: how can we adaptively select samples from a dataset to achieve optimal performance? In this paper, we propose a novel active learning based adaptive sampling strategy, Dataset Quantization with Active Learning based Adaptive Sampling (DQAS), to optimize the sample selection. In addition, we introduce a novel pipeline for dataset quantization, utilizing feature space from the final stage of dataset quantization to generate more precise dataset bins. Our comprehensive evaluations on the multiple datasets show that our approach outperforms the state-of-the-art dataset compression methods.
