Table of Contents
Fetching ...

Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least

Siddharth Joshi, Baharan Mirzasoleiman

TL;DR

The work tackles data-efficient contrastive self-supervised learning by identifying examples that most contribute to learning robust representations. It proves that the most beneficial examples are those with high expected augmentation similarity within their latent class, and develops SAS to select subsets that preserve alignment and class-center divergence, with rigorous generalization guarantees. SAS uses a non-monotone submodular optimization framework, solved efficiently by greedy and double-greedy procedures, and relies on proxy models or latent-class approximations to estimate augmentation similarity without labels. Empirically, SAS enables discarding 20% of CIFAR100 data and 40% of STL10/TinyImageNet without harming, and often improving, downstream performance across multiple SSL methods (SimCLR, BYOL, SimSiam, MoCo). The findings also show that the most SSL-beneficial examples are the least beneficial for supervised learning, offering practical guidance for data curation in SSL pipelines.

Abstract

Self-supervised learning (SSL) learns high-quality representations from large pools of unlabeled training data. As datasets grow larger, it becomes crucial to identify the examples that contribute the most to learning such representations. This enables efficient SSL by reducing the volume of data required. Nevertheless, quantifying the value of examples for SSL has remained an open question. In this work, we address this problem for the first time, by proving that examples that contribute the most to contrastive SSL are those that have the most similar augmentations to other examples, in expectation. We provide rigorous guarantees for the generalization performance of contrastive learning on such subsets. Through extensive experiments, we show that we can safely exclude 20% of examples from CIFAR100 and 40% from STL10 and TinyImageNet, without affecting downstream task performance. In general, subsets selected by our method outperform random subsets by over 3% across these datasets. Interestingly, we also discover the subsets that contribute the most to contrastive learning are those that contribute the least to supervised learning. Code available at https://github.com/bigml-cs-ucla/sas-data-efficient-contrastive-learning.

Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least

TL;DR

The work tackles data-efficient contrastive self-supervised learning by identifying examples that most contribute to learning robust representations. It proves that the most beneficial examples are those with high expected augmentation similarity within their latent class, and develops SAS to select subsets that preserve alignment and class-center divergence, with rigorous generalization guarantees. SAS uses a non-monotone submodular optimization framework, solved efficiently by greedy and double-greedy procedures, and relies on proxy models or latent-class approximations to estimate augmentation similarity without labels. Empirically, SAS enables discarding 20% of CIFAR100 data and 40% of STL10/TinyImageNet without harming, and often improving, downstream performance across multiple SSL methods (SimCLR, BYOL, SimSiam, MoCo). The findings also show that the most SSL-beneficial examples are the least beneficial for supervised learning, offering practical guidance for data curation in SSL pipelines.

Abstract

Self-supervised learning (SSL) learns high-quality representations from large pools of unlabeled training data. As datasets grow larger, it becomes crucial to identify the examples that contribute the most to learning such representations. This enables efficient SSL by reducing the volume of data required. Nevertheless, quantifying the value of examples for SSL has remained an open question. In this work, we address this problem for the first time, by proving that examples that contribute the most to contrastive SSL are those that have the most similar augmentations to other examples, in expectation. We provide rigorous guarantees for the generalization performance of contrastive learning on such subsets. Through extensive experiments, we show that we can safely exclude 20% of examples from CIFAR100 and 40% from STL10 and TinyImageNet, without affecting downstream task performance. In general, subsets selected by our method outperform random subsets by over 3% across these datasets. Interestingly, we also discover the subsets that contribute the most to contrastive learning are those that contribute the least to supervised learning. Code available at https://github.com/bigml-cs-ucla/sas-data-efficient-contrastive-learning.
Paper Structure (20 sections, 3 theorems, 36 equations, 8 figures, 1 algorithm)

This paper contains 20 sections, 3 theorems, 36 equations, 8 figures, 1 algorithm.

Key Result

Theorem 4.1

For any $l, k \in [K]$, if then the downstream error rate of NN classifier is

Figures (8)

  • Figure 1: Visualizing Expected Augmentation Distance $d_{x, x'}$. Pair of examples on left shows two examples that are semantically very similar as seen by their augmentations being very similar to each other, thus the expected augmentation distance between them is small. In contrast, pair of examples on the right are not as semantically similar, thus have augmentations that are very dissimilar to each other.
  • Figure 2: Most representative examples: examples in top row are each representative of their group (e.g. breed) in class dog.
  • Figure 3: Downstream Classification Accuracy of SAS Subsets vs. Random Subsets (reporting mean and std over 3 runs).
  • Figure 4: Evaluating SAS on other contrastive learning methods (training a ResNet-18).
  • Figure 5: Ablation study on CIFAR100 and STL10.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Theorem 4.1: huang2021towards
  • Definition 4.2: Expected augmentation distance
  • Theorem 4.3
  • proof
  • Theorem 2.1: Complete version of Theorem \ref{['thm:huang_main_paper']} huang2021towards