Gaussian-Constrained LeJEPA Representations for Unsupervised Scene Discovery and Pose Consistency
Mohsen Mostafa
TL;DR
The problem addresses unsupervised 3D scene discovery and camera pose consistency from unstructured image collections under challenging real-world conditions. The authors propose Gaussian-constrained embeddings inspired by LeJEPA, culminating in a LeJEPA-Enhanced pipeline that enforces isotropic Gaussian priors on image embeddings and uses Gaussian-based similarity measures. Across three progressively robust pipelines, they demonstrate that the Gaussian constraints improve scene separation and pose plausibility compared with heuristic baselines, indicating better generalization. While not introducing new theory, the work provides a practical bridge between theoretical self-supervised learning priors and real-world structure-from-motion tasks. This suggests a promising direction for integrating principled representation constraints into practical 3D reconstruction systems.
Abstract
Unsupervised 3D scene reconstruction from unstructured image collections remains a fundamental challenge in computer vision, particularly when images originate from multiple unrelated scenes and contain significant visual ambiguity. The Image Matching Challenge 2025 (IMC2025) highlights these difficulties by requiring both scene discovery and camera pose estimation under real-world conditions, including outliers and mixed content. This paper investigates the application of Gaussian-constrained representations inspired by LeJEPA (Joint Embedding Predictive Architecture) to address these challenges. We present three progressively refined pipelines, culminating in a LeJEPA-inspired approach that enforces isotropic Gaussian constraints on learned image embeddings. Rather than introducing new theoretical guarantees, our work empirically evaluates how these constraints influence clustering consistency and pose estimation robustness in practice. Experimental results on IMC2025 demonstrate that Gaussian-constrained embeddings can improve scene separation and pose plausibility compared to heuristic-driven baselines, particularly in visually ambiguous settings. These findings suggest that theoretically motivated representation constraints offer a promising direction for bridging self-supervised learning principles and practical structure-from-motion pipelines.
