Table of Contents
Fetching ...

Learning from Memory: Non-Parametric Memory Augmented Self-Supervised Learning of Visual Features

Thalles Silva, Helio Pedrini, Adín Ramírez Rivera

TL;DR

MaSSL addresses SSL instability by introducing a non-parametric memory that stores recent image representations and a stochastic memory-block strategy to regularize training. By comparing current views to past concepts through view–memory similarity distributions and enforcing consistency via cross-entropy across multiple memory blocks, MaSSL achieves stable, transferable visual features without extra regularizers. Key results show strong performance across transfer, retrieval, and low-shot tasks, with improved efficiency due to CLS-only training and avoidance of learned prototypes. The approach offers a practical, scalable alternative to clustering-based SSL, with clear benefits in robustness and resource usage.

Abstract

This paper introduces a novel approach to improving the training stability of self-supervised learning (SSL) methods by leveraging a non-parametric memory of seen concepts. The proposed method involves augmenting a neural network with a memory component to stochastically compare current image views with previously encountered concepts. Additionally, we introduce stochastic memory blocks to regularize training and enforce consistency between image views. We extensively benchmark our method on many vision tasks, such as linear probing, transfer learning, low-shot classification, and image retrieval on many datasets. The experimental results consolidate the effectiveness of the proposed approach in achieving stable SSL training without additional regularizers while learning highly transferable representations and requiring less computing time and resources.

Learning from Memory: Non-Parametric Memory Augmented Self-Supervised Learning of Visual Features

TL;DR

MaSSL addresses SSL instability by introducing a non-parametric memory that stores recent image representations and a stochastic memory-block strategy to regularize training. By comparing current views to past concepts through view–memory similarity distributions and enforcing consistency via cross-entropy across multiple memory blocks, MaSSL achieves stable, transferable visual features without extra regularizers. Key results show strong performance across transfer, retrieval, and low-shot tasks, with improved efficiency due to CLS-only training and avoidance of learned prototypes. The approach offers a practical, scalable alternative to clustering-based SSL, with clear benefits in robustness and resource usage.

Abstract

This paper introduces a novel approach to improving the training stability of self-supervised learning (SSL) methods by leveraging a non-parametric memory of seen concepts. The proposed method involves augmenting a neural network with a memory component to stochastically compare current image views with previously encountered concepts. Additionally, we introduce stochastic memory blocks to regularize training and enforce consistency between image views. We extensively benchmark our method on many vision tasks, such as linear probing, transfer learning, low-shot classification, and image retrieval on many datasets. The experimental results consolidate the effectiveness of the proposed approach in achieving stable SSL training without additional regularizers while learning highly transferable representations and requiring less computing time and resources.
Paper Structure (30 sections, 2 equations, 6 figures, 14 tables)

This paper contains 30 sections, 2 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Learning from memory. Given two or more views of an image, each view is encoded by the student and teacher encoders, resulting in respective vector representations $z^1$ and $z^2$. Each view's representation is compared against representations of previously seen images in memory, resulting in respective similarity distributions. Note that the working memory $\mathcal{M}$ is split into blocks, $M_i$, of randomly chosen representations. The learning objective, $\mathcal{L}$, forces the similarity distributions of views w.r.t. the memory representations to be consistent. In a case where the model perceives an image of a dog, the interaction between what it currently sees and what it remembers should produce (1) strong similarity scores for previously seen dogs, (2) weak scores for non-related images in the memory, and (3) interactions should be consistent among views.
  • Figure 2: Visualization of MaSSL's self-attention maps. Multiple heads are displayed in different colors.
  • Figure 3: Sparse correspondence results for MaSSL.
  • Figure C.1: Visualizing self-attention maps. From top to bottom, in each triplet of rows, we report qualitative evaluations for MaSSL, iBOT, and DINO. The columns show multiple attention heads of the last layer.
  • Figure C.2: Visualization for sparse correspondence. We assess the ability to match local embeddings using pairs of views from the same image. From top to bottom, in each triplet of rows, we report qualitative evaluations for MaSSL, iBOT, and DINO.
  • ...and 1 more figures