Table of Contents
Fetching ...

Harnessing small projectors and multiple views for efficient vision pretraining

Kumar Krishna Agrawal, Arna Ghosh, Shagun Sodhani, Adam Oberman, Blake Richards

TL;DR

This work builds on recent analytical results to design practical recommendations for competitive and efficient SSL that are grounded in theory, and empirically verify the findings on CIFAR, STL and Imagenet datasets.

Abstract

Recent progress in self-supervised (SSL) visual representation learning has led to the development of several different proposed frameworks that rely on augmentations of images but use different loss functions. However, there are few theoretically grounded principles to guide practice, so practical implementation of each SSL framework requires several heuristics to achieve competitive performance. In this work, we build on recent analytical results to design practical recommendations for competitive and efficient SSL that are grounded in theory. Specifically, recent theory tells us that existing SSL frameworks are minimizing the same idealized loss, which is to learn features that best match the data similarity kernel defined by the augmentations used. We show how this idealized loss can be reformulated to a functionally equivalent loss that is more efficient to compute. We study the implicit bias of using gradient descent to minimize our reformulated loss function and find that using a stronger orthogonalization constraint with a reduced projector dimensionality should yield good representations. Furthermore, the theory tells us that approximating the reformulated loss should be improved by increasing the number of augmentations, and as such using multiple augmentations should lead to improved convergence. We empirically verify our findings on CIFAR, STL and Imagenet datasets, wherein we demonstrate an improved linear readout performance when training a ResNet-backbone using our theoretically grounded recommendations. Remarkably, we also demonstrate that by leveraging these insights, we can reduce the pretraining dataset size by up to 2$\times$ while maintaining downstream accuracy simply by using more data augmentations. Taken together, our work provides theoretically grounded recommendations that can be used to improve SSL convergence and efficiency.

Harnessing small projectors and multiple views for efficient vision pretraining

TL;DR

This work builds on recent analytical results to design practical recommendations for competitive and efficient SSL that are grounded in theory, and empirically verify the findings on CIFAR, STL and Imagenet datasets.

Abstract

Recent progress in self-supervised (SSL) visual representation learning has led to the development of several different proposed frameworks that rely on augmentations of images but use different loss functions. However, there are few theoretically grounded principles to guide practice, so practical implementation of each SSL framework requires several heuristics to achieve competitive performance. In this work, we build on recent analytical results to design practical recommendations for competitive and efficient SSL that are grounded in theory. Specifically, recent theory tells us that existing SSL frameworks are minimizing the same idealized loss, which is to learn features that best match the data similarity kernel defined by the augmentations used. We show how this idealized loss can be reformulated to a functionally equivalent loss that is more efficient to compute. We study the implicit bias of using gradient descent to minimize our reformulated loss function and find that using a stronger orthogonalization constraint with a reduced projector dimensionality should yield good representations. Furthermore, the theory tells us that approximating the reformulated loss should be improved by increasing the number of augmentations, and as such using multiple augmentations should lead to improved convergence. We empirically verify our findings on CIFAR, STL and Imagenet datasets, wherein we demonstrate an improved linear readout performance when training a ResNet-backbone using our theoretically grounded recommendations. Remarkably, we also demonstrate that by leveraging these insights, we can reduce the pretraining dataset size by up to 2 while maintaining downstream accuracy simply by using more data augmentations. Taken together, our work provides theoretically grounded recommendations that can be used to improve SSL convergence and efficiency.
Paper Structure (30 sections, 7 theorems, 55 equations, 21 figures, 8 tables)

This paper contains 30 sections, 7 theorems, 55 equations, 21 figures, 8 tables.

Key Result

Theorem 3.1

Let $G(x)$ be the infinite Mercer features of the backward data augmentation covariance kernels, $k^{DAB}$. Let $F(x) = (f_1(x), \dots, f_{N_k}(x))$ be the features given by minimizing the following data augmentation invariance loss which includes the orthogonality constraint. Then, $V(F) \subset V(G)$ , $\lim_{N_k\to\infty} V(F) = V(G)$.

Figures (21)

  • Figure 1: Design of existing SSL algorithms relies on heuristics. (A) Augmentation graphs are common in vision pretraining, providing generalizable features for downstream tasks. (B) We propose an equivalent loss function for SSL pretraining that recovers the same eigenfunctions more efficiently than existing approaches.
  • Figure 2: Low-dimensional projectors can yield good representations. We demonstrate that using a higher orthogonality constraint, $\beta$, for lower projector dimensionality can achieve similar performance over a wide range of projector dimensions ($d$).
  • Figure 3: Using multiple augmentations improves representation learning performance and convergence. (A-C) Across BarlowTwins for CIFAR-10, STL-10 and Imagenet-100 pretraining, using 4 augmentations instead of 2 helps improve performance. Please see \ref{['sec:full_dset_pretraining']} for more results.
  • Figure 4: Multi-augmentation improves sample efficiency, recovering similar performance with significantly fewer unique samples in the pretraining dataset. Across BarlowTwins pretraining on CIFAR-10, STL-10 and Imagenet-100 for the same effective dataset size ($\#augs \times \#unique\_samples$), using more patches improves performance at the same epoch (A-C). However, a tradeoff exists wherein more data augmentations fail to improve performance in the scarce data regime.
  • Figure 5: Using $>2$ augmentations with a fraction of the dataset improves overall Pareto frontier, speeding runtime up to $\sim2\times$.
  • ...and 16 more figures

Theorems & Definitions (17)

  • Definition 2.1
  • Theorem 3.1
  • Theorem 3.2
  • Definition A.1
  • Definition A.2: Spectral Operator
  • Theorem A.3: Spectral Decomposition
  • Remark A.4
  • Definition A.5: Compact Operators
  • Definition A.6: Hilbert-Schmidt Operators
  • Definition A.7
  • ...and 7 more