Table of Contents
Fetching ...

A Review on Discriminative Self-supervised Learning Methods in Computer Vision

Nikolaos Giakoumoglou, Tania Stathaki, Athanasios Gkelias

TL;DR

The paper surveys discriminative self-supervised learning for computer vision, organizing methods into contrastive, clustering, self-distillation, knowledge distillation, and feature decorrelation families. It systematically analyzes architectural choices, pretext tasks, and loss functions, and evaluates methods via linear and semi-supervised benchmarks on ImageNet-1K, plus transfer to a wide range of classification and vision tasks. Key findings highlight strong linear and transfer performance from methods like ReLIC-v2, TWIST, and BYOL-based frameworks, while also underscoring challenges in scalability, robustness, and domain shift. The work emphasizes the need for efficient, broadly applicable SSL techniques, improved benchmarking, and theoretically grounded objectives to guide future research and practical deployment.

Abstract

Self-supervised learning (SSL) has rapidly emerged as a transformative approach in computer vision, enabling the extraction of rich feature representations from vast amounts of unlabeled data and reducing reliance on costly manual annotations. This review presents a comprehensive analysis of discriminative SSL methods, which focus on learning representations by solving pretext tasks that do not require human labels. The paper systematically categorizes discriminative SSL approaches into five main groups: contrastive methods, clustering methods, self-distillation methods, knowledge distillation methods, and feature decorrelation methods. For each category, the review details the underlying principles, architectural components, loss functions, and representative algorithms, highlighting their unique mechanisms and contributions to the field. Extensive comparative evaluations are provided, including linear and semi-supervised protocols on standard benchmarks such as ImageNet, as well as transfer learning performance across diverse downstream tasks. The review also discusses theoretical foundations, scalability, efficiency, and practical challenges, such as computational demands and accessibility. By synthesizing recent advancements and identifying key trends, open challenges, and future research directions, this work serves as a valuable resource for researchers and practitioners aiming to leverage discriminative SSL for robust and generalizable computer vision models.

A Review on Discriminative Self-supervised Learning Methods in Computer Vision

TL;DR

The paper surveys discriminative self-supervised learning for computer vision, organizing methods into contrastive, clustering, self-distillation, knowledge distillation, and feature decorrelation families. It systematically analyzes architectural choices, pretext tasks, and loss functions, and evaluates methods via linear and semi-supervised benchmarks on ImageNet-1K, plus transfer to a wide range of classification and vision tasks. Key findings highlight strong linear and transfer performance from methods like ReLIC-v2, TWIST, and BYOL-based frameworks, while also underscoring challenges in scalability, robustness, and domain shift. The work emphasizes the need for efficient, broadly applicable SSL techniques, improved benchmarking, and theoretically grounded objectives to guide future research and practical deployment.

Abstract

Self-supervised learning (SSL) has rapidly emerged as a transformative approach in computer vision, enabling the extraction of rich feature representations from vast amounts of unlabeled data and reducing reliance on costly manual annotations. This review presents a comprehensive analysis of discriminative SSL methods, which focus on learning representations by solving pretext tasks that do not require human labels. The paper systematically categorizes discriminative SSL approaches into five main groups: contrastive methods, clustering methods, self-distillation methods, knowledge distillation methods, and feature decorrelation methods. For each category, the review details the underlying principles, architectural components, loss functions, and representative algorithms, highlighting their unique mechanisms and contributions to the field. Extensive comparative evaluations are provided, including linear and semi-supervised protocols on standard benchmarks such as ImageNet, as well as transfer learning performance across diverse downstream tasks. The review also discusses theoretical foundations, scalability, efficiency, and practical challenges, such as computational demands and accessibility. By synthesizing recent advancements and identifying key trends, open challenges, and future research directions, this work serves as a valuable resource for researchers and practitioners aiming to leverage discriminative SSL for robust and generalizable computer vision models.
Paper Structure (60 sections, 7 equations, 21 figures, 7 tables)

This paper contains 60 sections, 7 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: Typical SSL pipeline. Once the training on the pretext task is complete, the learned parameters are adapted as a pre-trained model for various downstream computer vision tasks through fine-tuning. Image adjusted from jing2019selfsupervised.
  • Figure 2: Common architectural variations of Siamese networks used in self-supervised learning, each demonstrating different gradient flow mechanisms. (a) Siamese architecture with gradients (grad) flowing from both branches. (b) Siamese architecture where gradients flow from the first branch only. (c) Siamese architecture with gradient flow from the first branch and a stop-gradient (stop-grad) operation on the second branch, which is updated using an EMA.
  • Figure 3: Various data augmentation techniques applied to an original image of a yellow Labrador: (a) Original unmodified image; (b) Random crop and resize operation that focuses on a portion of the dog; (c) Similar crop and resize with possible horizontal flip; (d) Color distortion through grayscale conversion; (e) Color distortion through color jittering that alters the image's color properties; (f) Rotation applied at various angles (90°, 180°, 270°); (g) Cutout augmentation that masks a random square region of the image; (h) Gaussian noise injection that adds random pixel-level perturbations; (i) Gaussian blur that reduces image details while preserving structure; and (j) Sobel filtering that emphasizes edges and contours. These diverse transformations help self-supervised models learn invariant representations. Image from chen2020simclr.
  • Figure 4: High level comparison of discriminative SSL methods. (a) Contrastive methods employ instance discrimination. (b) Clustering methods use clustering algorithms to provide supervision. (c) Self-distillation methods use various techniques to avoid collapse while matching representations of Siamese networks. (d) Feature decorrelation methods promote feature diversity. (e) Knowledge distillation methods transfer knowledge from a frozen teacher network to a student network.
  • Figure 5: Overview of contrastive learning frameworks. (a) SimCLR uses simple data augmentations to generate positive pairs. (b) MoCo leverages a dynamic memory bank for negative samples. (c) ReSSL uses relation to look at relationships between instances. (d) MoCLR improves upon the SimCLR.
  • ...and 16 more figures