Self-Organizing Visual Prototypes for Non-Parametric Representation Learning
Thalles Silva, Helio Pedrini, Adín Ramírez Rivera
TL;DR
Self-Organizing Prototypes (SOP) introduce a non-parametric alternative to prototypical self-supervised learning by using memory-backed anchors and multiple support embeddings (SEs) to describe local regions of the feature space. SEs bootstrap richer, region-specific features and vote within SOPs to produce region-level similarity distributions, while SOP-MIM extends masked image modeling to a non-parametric, patch-level, local representation framework. The training objective combines a global SOP loss with a local SOP-MIM loss, enabling stable learning without extra regularizers. Across linear evaluation, fine-tuning, dense prediction, retrieval, and robustness benchmarks, SOP consistently improves over strong baselines, with larger gains observed as model capacity grows, demonstrating scalability and transferability across vision tasks.
Abstract
We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.
