Table of Contents
Fetching ...

Self-Supervised Learning with a Multi-Task Latent Space Objective

Pierre-François De Plaen, Abhishek Jha, Luc Van Gool, Tinne Tuytelaars, Marc Proesmans

TL;DR

The paper tackles instability in predictor-based Siamese self-supervised learning when using multi-crop augmentations. It stabilizes training by assigning a dedicated predictor to each view type while sharing the encoder, enabling effective exploitation of global and local crops; it further extends the framework with asymmetric cutout views to create a simple multi-task latent-space objective. Empirically, the approach yields consistent gains across BYOL, SimSiam, and MoCo v3 on ResNet and ViT backbones, achieving state-competitive ImageNet results and improved transfer to dense tasks like COCO. The findings highlight the primacy of spatial augmentations in SSL and open avenues for view-conditioned predictive architectures and broader modality extensions.

Abstract

Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable across backbones, and consistently improves the performance of ResNet and ViT models on ImageNet.

Self-Supervised Learning with a Multi-Task Latent Space Objective

TL;DR

The paper tackles instability in predictor-based Siamese self-supervised learning when using multi-crop augmentations. It stabilizes training by assigning a dedicated predictor to each view type while sharing the encoder, enabling effective exploitation of global and local crops; it further extends the framework with asymmetric cutout views to create a simple multi-task latent-space objective. Empirically, the approach yields consistent gains across BYOL, SimSiam, and MoCo v3 on ResNet and ViT backbones, achieving state-competitive ImageNet results and improved transfer to dense tasks like COCO. The findings highlight the primacy of spatial augmentations in SSL and open avenues for view-conditioned predictive architectures and broader modality extensions.

Abstract

Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable across backbones, and consistently improves the performance of ResNet and ViT models on ImageNet.
Paper Structure (42 sections, 1 equation, 3 figures, 9 tables)

This paper contains 42 sections, 1 equation, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of our predictor-based Siamese SSL framework. Naive multi-crop (left) is unstable under a shared predictor. Using one predictor per view type (middle) stabilizes training. Adding cutout views (right) yields a simple multi-task formulation that further improves downstream performance. Here, p. denotes a predictor, and colors indicate distinct predictors.
  • Figure 2: Overview of the proposed framework. Each image is augmented into multiple spatial views: global (views A), local (views B), and cutout (views C). A shared encoder, comprising a backbone and a projection head, extracts features for all views. View-specific prediction heads then generate predictions for each view type. All spatial alignment tasks are optimized jointly under a shared alignment objective. Image from ImageNet validation set (№ 43632).
  • Figure 3: Asymmetric and symmetric cutout. Image from ImageNet validation set (№ 7011).