Stable Single-Pixel Contrastive Learning for Semantic and Geometric Tasks
Leonid Pogorelyuk, Niels Bracher, Aaron Verkleeren, Lars Kühmichel, Stefan T. Radev
TL;DR
The authors address the challenge of learning pixel-level representations that are simultaneously semantic and geometric by introducing a stable, momentum-free family of contrastive losses. They train an overcomplete feature map to be view-invariant across 2D and 3D transformations, combining within-image and between-image terms into a single objective. The approach yields dense per-pixel descriptors capable of precise point correspondences and even encodes distinct semantic and geometric modes, demonstrated in synthetic 2D and 3D experiments. This method improves pixel-level localization and cross-view matching without teacher-student training, with potential benefits for dense correspondence and 3D understanding in downstream tasks.
Abstract
We pilot a family of stable contrastive losses for learning pixel-level representations that jointly capture semantic and geometric information. Our approach maps each pixel of an image to an overcomplete descriptor that is both view-invariant and semantically meaningful. It enables precise point-correspondence across images without requiring momentum-based teacher-student training. Two experiments in synthetic 2D and 3D environments demonstrate the properties of our loss and the resulting overcomplete representations.
