Table of Contents
Fetching ...

Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

Vighnesh Birodkar, Hossein Mobahi, Samy Bengio

TL;DR

This work identifies and exploits semantic redundancies in large image-classification datasets by clustering samples in a latent semantic space learned from the full data. Using class-wise agglomerative clustering and cosine-based dissimilarity, it retains one representative per cluster to form a reduced training subset. Across CIFAR-10 and ImageNet, removing at least 10% of data via this semantic clustering does not degrade test/validation accuracy, while CIFAR-100 shows limited redundancy. The results challenge the view that these datasets are entirely data-hungry and point to dataset-specific opportunities for data-efficiency and more informed data collection.

Abstract

Large datasets have been crucial to the success of deep learning models in the recent years, which keep performing better as they are trained with more labelled data. While there have been sustained efforts to make these models more data-efficient, the potential benefit of understanding the data itself, is largely untapped. Specifically, focusing on object recognition tasks, we wonder if for common benchmark datasets we can do better than random subsets of the data and find a subset that can generalize on par with the full dataset when trained on. To our knowledge, this is the first result that can find notable redundancies in CIFAR-10 and ImageNet datasets (at least 10%). Interestingly, we observe semantic correlations between required and redundant images. We hope that our findings can motivate further research into identifying additional redundancies and exploiting them for more efficient training or data-collection.

Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

TL;DR

This work identifies and exploits semantic redundancies in large image-classification datasets by clustering samples in a latent semantic space learned from the full data. Using class-wise agglomerative clustering and cosine-based dissimilarity, it retains one representative per cluster to form a reduced training subset. Across CIFAR-10 and ImageNet, removing at least 10% of data via this semantic clustering does not degrade test/validation accuracy, while CIFAR-100 shows limited redundancy. The results challenge the view that these datasets are entirely data-hungry and point to dataset-specific opportunities for data-efficiency and more informed data collection.

Abstract

Large datasets have been crucial to the success of deep learning models in the recent years, which keep performing better as they are trained with more labelled data. While there have been sustained efforts to make these models more data-efficient, the potential benefit of understanding the data itself, is largely untapped. Specifically, focusing on object recognition tasks, we wonder if for common benchmark datasets we can do better than random subsets of the data and find a subset that can generalize on par with the full dataset when trained on. To our knowledge, this is the first result that can find notable redundancies in CIFAR-10 and ImageNet datasets (at least 10%). Interestingly, we observe semantic correlations between required and redundant images. We hope that our findings can motivate further research into identifying additional redundancies and exploiting them for more efficient training or data-collection.

Paper Structure

This paper contains 15 sections, 2 equations, 24 figures, 1 table.

Figures (24)

  • Figure 1: Examples of different redundant groups of images from the ImageNet dataset while creating a subset 90% of the size of the full set. In each group, we list the semantic variation considered redundant. The images selected by semantic clustering are highlighted with a green box whereas the rest are discarded with no negative impact on generalization.
  • Figure 2: Performance of subsets of varying size on the CIFAR-10 dataset. Each point is an average across $10$ trials and the vertical bars denote standard deviation. We see no drop in test accuracy until 10% of the data considered redundant by semantic clustering is removed.
  • Figure 3: Examples of redundant images in the CIFAR-10 dataset when creating a subset of 90% size of the original set. The figure illustrates similarity between images of each redundant group and variation across different redundant groups. \ref{['fig:red_plane1']} and \ref{['fig:red_plane2']} are two different redundant groups of the class Airplane. \ref{['fig:red_truck1']} and \ref{['fig:red_truck2']} are two different redundant groups from class Truck. In each group, only the images marked with green boxes are kept and the rest, discarded. The discarded images did not lower test accuracy.
  • Figure 4: Number of redundant groups of various sizes in the CIFAR-10 dataset when finding a 90% subset for two classes. Note that the y-axis is logarithmic.
  • Figure 5: Performance of subsets of varying size on the CIFAR-100 dataset. Each point is an average over $10$ trials and the vertical bars denote standard deviation.
  • ...and 19 more figures