Seeing the Whole in the Parts in Self-Supervised Representation Learning

Arthur Aubret; Céline Teulière; Jochen Triesch

Seeing the Whole in the Parts in Self-Supervised Representation Learning

Arthur Aubret, Céline Teulière, Jochen Triesch

TL;DR

This work presents CO-SSL, a family of instance discrimination methods and shows that it outperforms previous methods on several datasets, including ImageNet-1K where it achieves 71.5% of Top-1 accuracy with 100 pre-training epochs.

Abstract

Recent successes in self-supervised learning (SSL) model spatial co-occurrences of visual features either by masking portions of an image or by aggressively cropping it. Here, we propose a new way to model spatial co-occurrences by aligning local representations (before pooling) with a global image representation. We present CO-SSL, a family of instance discrimination methods and show that it outperforms previous methods on several datasets, including ImageNet-1K where it achieves 71.5% of Top-1 accuracy with 100 pre-training epochs. CO-SSL is also more robust to noise corruption, internal corruption, small adversarial attacks, and large training crop sizes. Our analysis further indicates that CO-SSL learns highly redundant local representations, which offers an explanation for its robustness. Overall, our work suggests that aligning local and global representations may be a powerful principle of unsupervised category learning.

Seeing the Whole in the Parts in Self-Supervised Representation Learning

TL;DR

Abstract

Paper Structure (31 sections, 4 equations, 6 figures, 13 tables)

This paper contains 31 sections, 4 equations, 6 figures, 13 tables.

Introduction
Related works
Self-supervised ID learning.
Patch representations.
Spatial statistical learning.
Intra-network SSL.
Method
CO-SSL: Co-Occurrence Self-Supervised Learning
CO-SSL in general.
CO-BYOL in detail.
RF-ResNet family
Training and evaluation
Experiments
CO-BYOL outperforms previous methods on downstream category recognition
CO-SSL (ResNet50) is more robust to corruptions and adversarial attacks
...and 16 more sections

Figures (6)

Figure 1: Architecture of CO-BYOL. We augment an image through two augmentations pipelines and forward one to the visual encoder $f_{\theta}$ and one to the momentum encoder $f_{\xi}$. Both output a set of local representations. Then, we spatially average these local representations into two global representation, which are fed into MLPs computing projections and predictions for the standard loss function $\mathcal{L}_{g}$ of BYOL. In CO-BYOL, we also individually compute local embeddings of an image with two new local projectors and a new local predictor. Finally we symmetrically compute the averaged BYOL loss $\mathcal{L}_{l}$ between each local embedding and the global embedding of the other image.
Figure 2: RF-ResNet architecture. We omit residual connections for better readability. A) Overview of the architecture as a succession of convolutions blocks. "n" is the number of blocks in each of the four layers as in a standard ResNet, $m$ denotes the absence/presence of the MaxPool layer and $s'$,$s"$,$s"'$ denote the values of the stride parameters of the first block of a layer (cf. B). B) Zoom in on the two different types of blocks, which are a stack of convolution layers. C) Examples of RF sizes for four different RF-ResNet; it is independent from the number of blocks $n > 2$ in each layer.
Figure 3: A) Top-1 ImageNet-100 validation accuracy. We train different RF-ResNet50 and different minimum crop ratios $c_{min} \in \{0.1, 0.2, 0.3, 0.4, 0.7\}$. Due to its design, it is impossible to reach a RF of size $33\times33$ with a RF-ResNet, we use a parameter-matched BagNet33 brendel2018approximating. BagNets are similar to RF-ResNet, but defined for smaller ranges of RFs. For $\text{size(RF)}=425\times425$, we use a ResNet50. We also show BYOL with ResNet-50, as a reference baseline. B) Correlation between the cosine similarity between global representations and the cosine similarity between local representations on ImageNet-1K validation set. "C" denotes the Pearson correlation and vertical bars shows the cosine similarity between intra-image local representations.
Figure 4: Visualization of effective receptive fields of 4 local representations computed on two validation images for two methods. For diversity, we select four local representation with normalized coordinates in the features maps (0,0), (0, 0.5), (0.5,0) and (0.5,0.5) from left to right.
Figure 5: Top-1 ImageNet-100 validation accuracy. This is the same data shown in Figure \ref{['fig:rfsstudy']}, but we plot the accuracy against the minimum crop size.
...and 1 more figures

Seeing the Whole in the Parts in Self-Supervised Representation Learning

TL;DR

Abstract

Seeing the Whole in the Parts in Self-Supervised Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)