Table of Contents
Fetching ...

Contrastive Self-Supervised Learning As Neural Manifold Packing

Guanming Zhang, David J. Heeger, Stefano Martiniani

TL;DR

CLAMP reframes contrastive self-supervised learning as neural-manifold packing, using a physics-inspired short-range repulsive loss to separate augmentation sub-manifolds. By approximating each sub-manifold as an ellipsoid and optimizing a packing energy, CLAMP achieves competitive linear-evaluation performance and strong transfer to object detection, while revealing emergent, class-specific manifolds. The approach bridges non-equilibrium physics and SSL, with Brain-Score analyses showing cortical-like alignment in higher visual areas. Overall, manifold packing provides a principled, interpretable mechanism for structuring high-dimensional representations with practical downstream benefits.

Abstract

Contrastive self-supervised learning based on point-wise comparisons has been widely studied for vision tasks. In the visual cortex of the brain, neuronal responses to distinct stimulus classes are organized into geometric structures known as neural manifolds. Accurate classification of stimuli can be achieved by effectively separating these manifolds, akin to solving a packing problem. We introduce Contrastive Learning As Manifold Packing (CLAMP), a self-supervised framework that recasts representation learning as a manifold packing problem. CLAMP introduces a loss function inspired by the potential energy of short-range repulsive particle systems, such as those encountered in the physics of simple liquids and jammed packings. In this framework, each class consists of sub-manifolds embedding multiple augmented views of a single image. The sizes and positions of the sub-manifolds are dynamically optimized by following the gradient of a packing loss. This approach yields interpretable dynamics in the embedding space that parallel jamming physics, and introduces geometrically meaningful hyperparameters within the loss function. Under the standard linear evaluation protocol, which freezes the backbone and trains only a linear classifier, CLAMP achieves competitive performance with state-of-the-art self-supervised models. Furthermore, our analysis reveals that neural manifolds corresponding to different categories emerge naturally and are effectively separated in the learned representation space, highlighting the potential of CLAMP to bridge insights from physics, neural science, and machine learning.

Contrastive Self-Supervised Learning As Neural Manifold Packing

TL;DR

CLAMP reframes contrastive self-supervised learning as neural-manifold packing, using a physics-inspired short-range repulsive loss to separate augmentation sub-manifolds. By approximating each sub-manifold as an ellipsoid and optimizing a packing energy, CLAMP achieves competitive linear-evaluation performance and strong transfer to object detection, while revealing emergent, class-specific manifolds. The approach bridges non-equilibrium physics and SSL, with Brain-Score analyses showing cortical-like alignment in higher visual areas. Overall, manifold packing provides a principled, interpretable mechanism for structuring high-dimensional representations with practical downstream benefits.

Abstract

Contrastive self-supervised learning based on point-wise comparisons has been widely studied for vision tasks. In the visual cortex of the brain, neuronal responses to distinct stimulus classes are organized into geometric structures known as neural manifolds. Accurate classification of stimuli can be achieved by effectively separating these manifolds, akin to solving a packing problem. We introduce Contrastive Learning As Manifold Packing (CLAMP), a self-supervised framework that recasts representation learning as a manifold packing problem. CLAMP introduces a loss function inspired by the potential energy of short-range repulsive particle systems, such as those encountered in the physics of simple liquids and jammed packings. In this framework, each class consists of sub-manifolds embedding multiple augmented views of a single image. The sizes and positions of the sub-manifolds are dynamically optimized by following the gradient of a packing loss. This approach yields interpretable dynamics in the embedding space that parallel jamming physics, and introduces geometrically meaningful hyperparameters within the loss function. Under the standard linear evaluation protocol, which freezes the backbone and trains only a linear classifier, CLAMP achieves competitive performance with state-of-the-art self-supervised models. Furthermore, our analysis reveals that neural manifolds corresponding to different categories emerge naturally and are effectively separated in the learned representation space, highlighting the potential of CLAMP to bridge insights from physics, neural science, and machine learning.

Paper Structure

This paper contains 39 sections, 9 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: CLAMP architecture. The CLAMP framework processes a batch of $b$ input images by applying augmentations to generate $m$ views for each image. These augmented views are then encoded and projected into a shared embedding space. Within this space, the augmented embeddings corresponding to each input form a distinct sub-manifold, resulting in $b$ such sub-manifolds. Then, a pairwise packing loss is applied to minimize overlap between these sub-manifolds. The gradient of the loss is subsequently backpropagated to optimize the model.
  • Figure 2: Sub-manifold and visualization of the embedding space. (a) Schematic for approximating augmentation sub-manifolds as ellipsoids with different scale factor $r_s$. (b) We selected 10 images from the MNIST dataset, one for each digit from 0 to 9, and applied Gaussian noise augmentation. These augmented images were then encoded into a 3-dimensional embedding space for visualization. Solid dots represent the embedding points of each augmented view, while the shaded regions denote circumscribed ellipsoids defined by $(\Tilde{z} - Z_i) (\Lambda_i)^{-1} (\Tilde{z} - Z_i) = r_s^2$. Left: the initial embeddings. Right: the trained embeddings. For this toy example, we use $r_s = 3.0$
  • Figure 3: Training dynamics: (a) Number of neighbours as a function of epochs. (b) Average embedding sub-manifold sizes as the function of epochs. (c) Distances between pair of embeddings for untrained and trained networks.
  • Figure 4: The properties of sub-manifolds in the embedding space for the pretrained ResNet-18 network are characterized by: (a) Orientation similarity: the squared cosine similarity between the principal orientations of the sub-manifolds. (b) Centroid distances: the Euclidean distances between the centroids of different sub-manifolds. (c) Centroid similarity: the cosine similarity between the centroid points of sub-manifolds.
  • Figure 5: t-SNE visualization of the representations. Visualization of the 256-dimensional representation space by t-SNE method. Each color shows the representation corresponding to different category in CIFAR-10 dataset.
  • ...and 2 more figures