Table of Contents
Fetching ...

Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression

Chenyue Yu, Lingao Xiao, Jinhong Deng, Ivor W. Tsang, Yang He

TL;DR

Extensive experiments show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction.

Abstract

Large-scale image datasets are fundamental to deep learning, but their high storage demands pose challenges for deployment in resource-constrained environments. While existing approaches reduce dataset size by discarding samples, they often ignore the significant redundancy within each image -- particularly in the color space. To address this, we propose Dataset Color Quantization (DCQ), a unified framework that compresses visual datasets by reducing color-space redundancy while preserving information crucial for model training. DCQ achieves this by enforcing consistent palette representations across similar images, selectively retaining semantically important colors guided by model perception, and maintaining structural details necessary for effective feature learning. Extensive experiments across CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction. Code is available at \href{https://github.com/he-y/Dataset-Color-Quantization}{https://github.com/he-y/Dataset-Color-Quantization}.

Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression

TL;DR

Extensive experiments show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction.

Abstract

Large-scale image datasets are fundamental to deep learning, but their high storage demands pose challenges for deployment in resource-constrained environments. While existing approaches reduce dataset size by discarding samples, they often ignore the significant redundancy within each image -- particularly in the color space. To address this, we propose Dataset Color Quantization (DCQ), a unified framework that compresses visual datasets by reducing color-space redundancy while preserving information crucial for model training. DCQ achieves this by enforcing consistent palette representations across similar images, selectively retaining semantically important colors guided by model perception, and maintaining structural details necessary for effective feature learning. Extensive experiments across CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction. Code is available at \href{https://github.com/he-y/Dataset-Color-Quantization}{https://github.com/he-y/Dataset-Color-Quantization}.
Paper Structure (31 sections, 8 equations, 10 figures, 20 tables)

This paper contains 31 sections, 8 equations, 10 figures, 20 tables.

Figures (10)

  • Figure 1: Visualization of different Color Quantization algorithms on Tiny-ImageNet. Images are quantized into 4 colors, which are 2 color bits. (a) Original images. (b) Color quantization is performed through K-Means clustering to obtain representative color palettes for each image, wasting bits on backgrounds. (c) Independent and representative color palettes obtained by ColorCNN, which have abrupt textural discontinuities. (d) Our DCQ assigns more colors to foregrounds and has less textural discontinuity.
  • Figure 2: Comparison of the original image and feature maps extracted from a ResNet-18 trained on CIFAR-10. Thermal color maps visualize activation strength from black to white, reflecting learned feature hierarchy.
  • Figure 3: The pipeline of our dataset color quantization framework. First, we apply K-means clustering to group images based on their features extracted from a pre-trained model. Next, within each cluster, we perform K-means on the color palettes of individual images to generate a shared color palette for all images in the same cluster. The generated palettes and their corresponding indices are stored for later use. During training, we retrieve the stored indices and palettes to reconstruct quantized images, which are then used to train a neural network.
  • Figure 4: Visualization of four clusters obtained by applying ResNet-18 first-block features to partition CIFAR-10 into 20 clusters.
  • Figure 5: Comparison between prior color quantization methods and our approach, evaluated with ResNet-18. Detailed results are provided in Appendix \ref{['appendix:primary-cq']}. Unlike the original paper, which quantized the test set while training on the original train set, we quantize the entire train-set and keep the test-set unchanged.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Definition 1: Color Space Distribution
  • Definition 2: Color Palette Impact Level