Harnessing small projectors and multiple views for efficient vision pretraining

Kumar Krishna Agrawal; Arna Ghosh; Shagun Sodhani; Adam Oberman; Blake Richards

Harnessing small projectors and multiple views for efficient vision pretraining

Kumar Krishna Agrawal, Arna Ghosh, Shagun Sodhani, Adam Oberman, Blake Richards

TL;DR

This work builds on recent analytical results to design practical recommendations for competitive and efficient SSL that are grounded in theory, and empirically verify the findings on CIFAR, STL and Imagenet datasets.

Abstract

Recent progress in self-supervised (SSL) visual representation learning has led to the development of several different proposed frameworks that rely on augmentations of images but use different loss functions. However, there are few theoretically grounded principles to guide practice, so practical implementation of each SSL framework requires several heuristics to achieve competitive performance. In this work, we build on recent analytical results to design practical recommendations for competitive and efficient SSL that are grounded in theory. Specifically, recent theory tells us that existing SSL frameworks are minimizing the same idealized loss, which is to learn features that best match the data similarity kernel defined by the augmentations used. We show how this idealized loss can be reformulated to a functionally equivalent loss that is more efficient to compute. We study the implicit bias of using gradient descent to minimize our reformulated loss function and find that using a stronger orthogonalization constraint with a reduced projector dimensionality should yield good representations. Furthermore, the theory tells us that approximating the reformulated loss should be improved by increasing the number of augmentations, and as such using multiple augmentations should lead to improved convergence. We empirically verify our findings on CIFAR, STL and Imagenet datasets, wherein we demonstrate an improved linear readout performance when training a ResNet-backbone using our theoretically grounded recommendations. Remarkably, we also demonstrate that by leveraging these insights, we can reduce the pretraining dataset size by up to 2$\times$ while maintaining downstream accuracy simply by using more data augmentations. Taken together, our work provides theoretically grounded recommendations that can be used to improve SSL convergence and efficiency.

Harnessing small projectors and multiple views for efficient vision pretraining

TL;DR

Abstract

while maintaining downstream accuracy simply by using more data augmentations. Taken together, our work provides theoretically grounded recommendations that can be used to improve SSL convergence and efficiency.

Paper Structure (30 sections, 7 theorems, 55 equations, 21 figures, 8 tables)

This paper contains 30 sections, 7 theorems, 55 equations, 21 figures, 8 tables.

Introduction
Preliminaries
Implicit bias of non-contrastive SSL loss and optimization
Features in terms of data augmentation kernels
The implicit bias of gradient descent
Takeaway 1: Low-dimensional projectors can yield good representations
Takeaway 2: Multiple augmentations improve kernel approximation
Experiments
Low-dimensional projectors can yield good representations
Multiple Augmentations Improve Performance and Convergence
Sample Efficient Multi-augmentation Learning
Related Work
Discussion
Hilbert Space of functions
Functions and inner product space
...and 15 more sections

Key Result

Theorem 3.1

Let $G(x)$ be the infinite Mercer features of the backward data augmentation covariance kernels, $k^{DAB}$. Let $F(x) = (f_1(x), \dots, f_{N_k}(x))$ be the features given by minimizing the following data augmentation invariance loss which includes the orthogonality constraint. Then, $V(F) \subset V(G)$ , $\lim_{N_k\to\infty} V(F) = V(G)$.

Figures (21)

Figure 1: Design of existing SSL algorithms relies on heuristics. (A) Augmentation graphs are common in vision pretraining, providing generalizable features for downstream tasks. (B) We propose an equivalent loss function for SSL pretraining that recovers the same eigenfunctions more efficiently than existing approaches.
Figure 2: Low-dimensional projectors can yield good representations. We demonstrate that using a higher orthogonality constraint, $\beta$, for lower projector dimensionality can achieve similar performance over a wide range of projector dimensions ($d$).
Figure 3: Using multiple augmentations improves representation learning performance and convergence. (A-C) Across BarlowTwins for CIFAR-10, STL-10 and Imagenet-100 pretraining, using 4 augmentations instead of 2 helps improve performance. Please see \ref{['sec:full_dset_pretraining']} for more results.
Figure 4: Multi-augmentation improves sample efficiency, recovering similar performance with significantly fewer unique samples in the pretraining dataset. Across BarlowTwins pretraining on CIFAR-10, STL-10 and Imagenet-100 for the same effective dataset size ($\#augs \times \#unique\_samples$), using more patches improves performance at the same epoch (A-C). However, a tradeoff exists wherein more data augmentations fail to improve performance in the scarce data regime.
Figure 5: Using $>2$ augmentations with a fraction of the dataset improves overall Pareto frontier, speeding runtime up to $\sim2\times$.
...and 16 more figures

Theorems & Definitions (17)

Definition 2.1
Theorem 3.1
Theorem 3.2
Definition A.1
Definition A.2: Spectral Operator
Theorem A.3: Spectral Decomposition
Remark A.4
Definition A.5: Compact Operators
Definition A.6: Hilbert-Schmidt Operators
Definition A.7
...and 7 more

Harnessing small projectors and multiple views for efficient vision pretraining

TL;DR

Abstract

Harnessing small projectors and multiple views for efficient vision pretraining

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (17)