Table of Contents
Fetching ...

Alligat0R: Pre-Training Through Co-Visibility Segmentation for Relative Camera Pose Regression

Thibaut Loiseau, Guillaume Bourmaud, Vincent Lepetit

TL;DR

Alligat0R replaces CroCo's cross-view reconstruction with a covisibility segmentation pretraining objective, enabling robust learning in both covisible and non-covisible regions. The authors introduce Cub3, a 5M-pair dataset with dense covisibility annotations from nuScenes and ScanNet, and demonstrate state-of-the-art performance on metric relative pose regression, especially under challenging, low-overlap conditions. The approach yields interpretable covisibility maps and shows robust generalization to out-of-domain dense matching tasks, with competitive results against CroCo v2. Together, these contributions advance pretraining for binocular vision by aligning learning objectives with actual geometric reasoning required for pose estimation, and by providing a large-scale dataset to support further research.

Abstract

Pre-training techniques have greatly advanced computer vision, with CroCo's cross-view completion approach yielding impressive results in tasks like 3D reconstruction and pose regression. However, cross-view completion is ill-posed in non-covisible regions, limiting its effectiveness. We introduce Alligat0R, a novel pre-training approach that replaces cross-view learning with a covisibility segmentation task. Our method predicts whether each pixel in one image is covisible in the second image, occluded, or outside the field of view, making the pre-training effective in both covisible and non-covisible regions, and provides interpretable predictions. To support this, we present Cub3, a large-scale dataset with 5M image pairs and dense covisibility annotations derived from the nuScenes and ScanNet datasets. Cub3 includes diverse scenarios with varying degrees of overlap. The experiments show that our novel pre-training method Alligat0R significantly outperforms CroCo in relative pose regression. Code is available at https://github.com/thibautloiseau/alligat0r.

Alligat0R: Pre-Training Through Co-Visibility Segmentation for Relative Camera Pose Regression

TL;DR

Alligat0R replaces CroCo's cross-view reconstruction with a covisibility segmentation pretraining objective, enabling robust learning in both covisible and non-covisible regions. The authors introduce Cub3, a 5M-pair dataset with dense covisibility annotations from nuScenes and ScanNet, and demonstrate state-of-the-art performance on metric relative pose regression, especially under challenging, low-overlap conditions. The approach yields interpretable covisibility maps and shows robust generalization to out-of-domain dense matching tasks, with competitive results against CroCo v2. Together, these contributions advance pretraining for binocular vision by aligning learning objectives with actual geometric reasoning required for pose estimation, and by providing a large-scale dataset to support further research.

Abstract

Pre-training techniques have greatly advanced computer vision, with CroCo's cross-view completion approach yielding impressive results in tasks like 3D reconstruction and pose regression. However, cross-view completion is ill-posed in non-covisible regions, limiting its effectiveness. We introduce Alligat0R, a novel pre-training approach that replaces cross-view learning with a covisibility segmentation task. Our method predicts whether each pixel in one image is covisible in the second image, occluded, or outside the field of view, making the pre-training effective in both covisible and non-covisible regions, and provides interpretable predictions. To support this, we present Cub3, a large-scale dataset with 5M image pairs and dense covisibility annotations derived from the nuScenes and ScanNet datasets. Cub3 includes diverse scenarios with varying degrees of overlap. The experiments show that our novel pre-training method Alligat0R significantly outperforms CroCo in relative pose regression. Code is available at https://github.com/thibautloiseau/alligat0r.

Paper Structure

This paper contains 31 sections, 7 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: We introduce Alligat0R, a novel pretraining method for binocular vision. Alligat0R explicitly segments pixels as covisible, occluded, or outside field-of-view, overcoming the fundamental limitation of CroCo weinzaepfel2022crocoweinzaepfel2023croco which attempts to reconstruct potentially non-covisible regions.
  • Figure 2: Overview of Alligat0R. (a) During pre-training, we use the same architecture as CroCo but replace the reconstruction objective with a covisibility segmentation task, without masking, where each pixel in one view is classified as covisible, occluded, or outside FOV with respect to the other view. (b) For fine-tuning on the relative pose regression task, we pool features from both views, process them through a shared MLP, and use separate heads for predicting translation and rotation.
  • Figure 3: Covisibility annotation examples from Cub3 for nuScenes (top) and ScanNet (bottom). For each image pair, we show the corresponding covisibility maps with color-coding for covisible, occluded, and outside FOV regions. Note how our annotation process handles varying degrees of overlap and challenging viewpoint changes. Let us highlight that some annotations, particularly the distinction between covisible and occluded pixels, may contain noise, especially for nuScenes, and we demonstrate in the experiments that Alligat0R is highly robust to this noise.
  • Figure 4: Distributions of overlap, scale ratio, and viewpoint angle in Cub3-all and Cub3-50 for nuScenes (left) and ScanNet (right).
  • Figure 5: Performance of CroCo and Alligat0R across different geometric challenges on RUBIK. Results show accuracy at the $5^\circ$/2m threshold for models trained on different datasets (Cub3-50 or Cub3-all) and for frozen backbones. Alligat0R trained on Cub3-all consistently outperforms other configurations, particularly for challenging cases with low overlap, large scale differences, and extreme viewpoint changes.
  • ...and 7 more figures