Table of Contents
Fetching ...

Self-Organizing Visual Prototypes for Non-Parametric Representation Learning

Thalles Silva, Helio Pedrini, Adín Ramírez Rivera

TL;DR

Self-Organizing Prototypes (SOP) introduce a non-parametric alternative to prototypical self-supervised learning by using memory-backed anchors and multiple support embeddings (SEs) to describe local regions of the feature space. SEs bootstrap richer, region-specific features and vote within SOPs to produce region-level similarity distributions, while SOP-MIM extends masked image modeling to a non-parametric, patch-level, local representation framework. The training objective combines a global SOP loss with a local SOP-MIM loss, enabling stable learning without extra regularizers. Across linear evaluation, fine-tuning, dense prediction, retrieval, and robustness benchmarks, SOP consistently improves over strong baselines, with larger gains observed as model capacity grows, demonstrating scalability and transferability across vision tasks.

Abstract

We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.

Self-Organizing Visual Prototypes for Non-Parametric Representation Learning

TL;DR

Self-Organizing Prototypes (SOP) introduce a non-parametric alternative to prototypical self-supervised learning by using memory-backed anchors and multiple support embeddings (SEs) to describe local regions of the feature space. SEs bootstrap richer, region-specific features and vote within SOPs to produce region-level similarity distributions, while SOP-MIM extends masked image modeling to a non-parametric, patch-level, local representation framework. The training objective combines a global SOP loss with a local SOP-MIM loss, enabling stable learning without extra regularizers. Across linear evaluation, fine-tuning, dense prediction, retrieval, and robustness benchmarks, SOP consistently improves over strong baselines, with larger gains observed as model capacity grows, demonstrating scalability and transferability across vision tasks.

Abstract

We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.

Paper Structure

This paper contains 38 sections, 4 equations, 5 figures, 20 tables.

Figures (5)

  • Figure 1: $k$-NN top-1 accuracy on ImageNet.
  • Figure 2: First, we select a set of random anchors ${\bm{A}} = \left\{ {\bm{a}}_i \right\}_{i=0}^{K}$ (colored squares with patterns) from a set of representations kept in memory (gray sphere). Second, each anchor selects $k$ support embeddings (SEs) (colored diamonds) as their nearest neighbors (2 in this illustration). Each anchor ${\bm{a}}_i$ and their SEs form an SOP, representing a hidden structure within the data (shaded colored region). Note that a given embedding may belong to more than one SOP simultaneously. Put together, SOPs can be linearly arranged as a dataset ${\bm{D}} \in \mathbb{R}^{K(k+1) \times d}$ with labels ${\bm{Y}} \in \mathbb{R}^{K(k+1) \times K}$ representing the interconnections between SEs and anchors. Intuitively, each SOP contains a set of SEs that estimate the degree of similarity between views and SOPs. Then, SEs combine their votes to produce a final score for each view, resulting in similarity distributions optimized to be consistent across SOPs.
  • Figure C.1: Blockwise vs. Random masking.
  • Figure C.2: t-SNE visualizations on CIFAR-10 with SSL pre-trained ViT-Base feature extractors: SOP (left) vs. iBOT (right).
  • Figure C.3: t-SNE visualization on CIFAR100 using SSL pre-trained ViT-Base encoders as feature extractors. Qualitative results for SOP (left) and iBOT (right).