Table of Contents
Fetching ...

Dataset Quantization with Active Learning based Adaptive Sampling

Zhenghao Zhao, Yuzhang Shang, Junyi Wu, Yan Yan

TL;DR

This work tackles data efficiency in deep learning by showing that class-wise sensitivity to sample quantity varies significantly across dataset compression methods. It introduces Dataset Quantization with Active Learning based Adaptive Sampling (DQAS), combining class-wise initialization and error-reduction-driven active learning with a patchified-image-aware quantization pipeline to adapt sampling and maintain consistent feature representations. Empirical results on CIFAR-10, CIFAR-100, and Tiny ImageNet show that DQAS outperforms state-of-the-art dataset compression methods, especially at low sampling ratios, achieving strong performance with reduced data. The proposed approach offers a practical framework for imbalanced dataset compression and efficient data management in deep learning pipelines.

Abstract

Deep learning has made remarkable progress recently, largely due to the availability of large, well-labeled datasets. However, the training on such datasets elevates costs and computational demands. To address this, various techniques like coreset selection, dataset distillation, and dataset quantization have been explored in the literature. Unlike traditional techniques that depend on uniform sample distributions across different classes, our research demonstrates that maintaining performance is feasible even with uneven distributions. We find that for certain classes, the variation in sample quantity has a minimal impact on performance. Inspired by this observation, an intuitive idea is to reduce the number of samples for stable classes and increase the number of samples for sensitive classes to achieve a better performance with the same sampling ratio. Then the question arises: how can we adaptively select samples from a dataset to achieve optimal performance? In this paper, we propose a novel active learning based adaptive sampling strategy, Dataset Quantization with Active Learning based Adaptive Sampling (DQAS), to optimize the sample selection. In addition, we introduce a novel pipeline for dataset quantization, utilizing feature space from the final stage of dataset quantization to generate more precise dataset bins. Our comprehensive evaluations on the multiple datasets show that our approach outperforms the state-of-the-art dataset compression methods.

Dataset Quantization with Active Learning based Adaptive Sampling

TL;DR

This work tackles data efficiency in deep learning by showing that class-wise sensitivity to sample quantity varies significantly across dataset compression methods. It introduces Dataset Quantization with Active Learning based Adaptive Sampling (DQAS), combining class-wise initialization and error-reduction-driven active learning with a patchified-image-aware quantization pipeline to adapt sampling and maintain consistent feature representations. Empirical results on CIFAR-10, CIFAR-100, and Tiny ImageNet show that DQAS outperforms state-of-the-art dataset compression methods, especially at low sampling ratios, achieving strong performance with reduced data. The proposed approach offers a practical framework for imbalanced dataset compression and efficient data management in deep learning pipelines.

Abstract

Deep learning has made remarkable progress recently, largely due to the availability of large, well-labeled datasets. However, the training on such datasets elevates costs and computational demands. To address this, various techniques like coreset selection, dataset distillation, and dataset quantization have been explored in the literature. Unlike traditional techniques that depend on uniform sample distributions across different classes, our research demonstrates that maintaining performance is feasible even with uneven distributions. We find that for certain classes, the variation in sample quantity has a minimal impact on performance. Inspired by this observation, an intuitive idea is to reduce the number of samples for stable classes and increase the number of samples for sensitive classes to achieve a better performance with the same sampling ratio. Then the question arises: how can we adaptively select samples from a dataset to achieve optimal performance? In this paper, we propose a novel active learning based adaptive sampling strategy, Dataset Quantization with Active Learning based Adaptive Sampling (DQAS), to optimize the sample selection. In addition, we introduce a novel pipeline for dataset quantization, utilizing feature space from the final stage of dataset quantization to generate more precise dataset bins. Our comprehensive evaluations on the multiple datasets show that our approach outperforms the state-of-the-art dataset compression methods.
Paper Structure (13 sections, 9 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 13 sections, 9 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Accuracy by category and sample fraction visualization on CIFAR-10. This figure illustrates the outcomes of applying Dataset Quantization (DQ) across various sample fractions, followed by an evaluation of model accuracy for each category. It reveals that not all classes benefit equally from an increase in the number of samples.
  • Figure 2: Comparison between the pipeline of DQAS and DQ.Red arrows are the differences between our pipeline and DQ's. The pipeline of ours leverages the dataset features from the reconstructed data. In this way, the dataset features remain consistent before and after dropping patches, ensuring that the dataset bin generation is not adversely affected by the patch removal and results in a more precise output $\mathcal{S}'$.
  • Figure 3: Counts and accuracy by category comparison between DQ and DQAS.
  • Figure : class-wise dataset initialization.