Table of Contents
Fetching ...

Stable Single-Pixel Contrastive Learning for Semantic and Geometric Tasks

Leonid Pogorelyuk, Niels Bracher, Aaron Verkleeren, Lars Kühmichel, Stefan T. Radev

TL;DR

The authors address the challenge of learning pixel-level representations that are simultaneously semantic and geometric by introducing a stable, momentum-free family of contrastive losses. They train an overcomplete feature map to be view-invariant across 2D and 3D transformations, combining within-image and between-image terms into a single objective. The approach yields dense per-pixel descriptors capable of precise point correspondences and even encodes distinct semantic and geometric modes, demonstrated in synthetic 2D and 3D experiments. This method improves pixel-level localization and cross-view matching without teacher-student training, with potential benefits for dense correspondence and 3D understanding in downstream tasks.

Abstract

We pilot a family of stable contrastive losses for learning pixel-level representations that jointly capture semantic and geometric information. Our approach maps each pixel of an image to an overcomplete descriptor that is both view-invariant and semantically meaningful. It enables precise point-correspondence across images without requiring momentum-based teacher-student training. Two experiments in synthetic 2D and 3D environments demonstrate the properties of our loss and the resulting overcomplete representations.

Stable Single-Pixel Contrastive Learning for Semantic and Geometric Tasks

TL;DR

The authors address the challenge of learning pixel-level representations that are simultaneously semantic and geometric by introducing a stable, momentum-free family of contrastive losses. They train an overcomplete feature map to be view-invariant across 2D and 3D transformations, combining within-image and between-image terms into a single objective. The approach yields dense per-pixel descriptors capable of precise point correspondences and even encodes distinct semantic and geometric modes, demonstrated in synthetic 2D and 3D experiments. This method improves pixel-level localization and cross-view matching without teacher-student training, with potential benefits for dense correspondence and 3D understanding in downstream tasks.

Abstract

We pilot a family of stable contrastive losses for learning pixel-level representations that jointly capture semantic and geometric information. Our approach maps each pixel of an image to an overcomplete descriptor that is both view-invariant and semantically meaningful. It enables precise point-correspondence across images without requiring momentum-based teacher-student training. Two experiments in synthetic 2D and 3D environments demonstrate the properties of our loss and the resulting overcomplete representations.

Paper Structure

This paper contains 26 sections, 5 equations, 9 figures.

Figures (9)

  • Figure 1: An experiment of matching pixels between different views (top left) of synthetic scenes with objects from ShapeNetCore chang2015shapenetinformationrich3dmodel placed in rooms from SceneNN 7785081 (from Experiment 2). Our loss (with the $\infty$ norm) encourages non-matching features to differ by at least one channel, producing a network that encodes separate objects (instance segmentation; top right channels) and also identifies edges and interiors (bottom right), without explicitly being trained to do so.
  • Figure 2: Experiment 1. Top Left: ID and OOD accuracy vs. $\lambda$ (relative weight) and $p$ (norm). Wide violins show seed variability: runs either encode color ($\approx$85% ID, $\approx$10% OOD) or suppress it (balanced $\approx$75%). Lower $\lambda$ (more weight on $\mathcal{L}_{\text{between}}$) increases the probability of encoding color; at $\lambda=1$ ($\mathcal{L}_{\text{within}}$ only), features become invariant to color. Top Right: Feature maps from frozen backbones at $\lambda=0.5$ for two example inputs: $\ell_\infty$ yields sharp, bit-like encodings; $\ell_1,\ell_2$ yield smoother encodings. Downstream classification on ColoredMNIST revealed color-encoding of all three backbones, which is clearly apparent for $\ell_\infty$ and $\ell_1$; however, this is not the case for $\ell_2$. Bottom: Shown are the loss curves for each norm averaged over the 10 seeds per $\lambda$. For $\ell_{\infty}$ and $\ell_1$ backbone training is stable and robust across different $\lambda$ in contrast to traditional $\ell_2$ norm. A higher weight of contrastive loss between embeddings (lower $\lambda$) results in a higher overall loss.
  • Figure 3: Top row: source image, its transformed view (target 1), and a different image (target 2). Bottom row: Each source pixel (colored for visualization only) is mapped onto the location of the most similar (in $l_2$ distance) encoding.
  • Figure 4: Validation loss curve with $\ell_{\infty}$ and $\lambda =1$. The loss demonstrates stable convergence with minimal variance.
  • Figure 5: Example overcomplete representations obtained via an $l_\infty$-based loss from validation samples.
  • ...and 4 more figures