Table of Contents
Fetching ...

SCENES: Subpixel Correspondence Estimation With Epipolar Supervision

Dominik A. Kloepfer, João F. Henriques, Dylan Campbell

TL;DR

SCENES addresses the challenge of learning high-quality subpixel image correspondences without requiring ground-truth point matches or 3D structure. It replaces standard correspondence losses with epipolar losses that constrain matches to lie on epipolar lines derived from relative camera poses, enabling effective finetuning with pose supervision and a bootstrapping strategy that removes the need for pose data. The method demonstrates strong improvements on challenging indoor and outdoor datasets, outperforming several state-of-the-art detectors and matchers under weak or no supervision. This work broadens the applicability of learned local matching in new domains and reduces the annotation burden for adapting perception systems to novel environments.

Abstract

Extracting point correspondences from two or more views of a scene is a fundamental computer vision problem with particular importance for relative camera pose estimation and structure-from-motion. Existing local feature matching approaches, trained with correspondence supervision on large-scale datasets, obtain highly-accurate matches on the test sets. However, they do not generalise well to new datasets with different characteristics to those they were trained on, unlike classic feature extractors. Instead, they require finetuning, which assumes that ground-truth correspondences or ground-truth camera poses and 3D structure are available. We relax this assumption by removing the requirement of 3D structure, e.g., depth maps or point clouds, and only require camera pose information, which can be obtained from odometry. We do so by replacing correspondence losses with epipolar losses, which encourage putative matches to lie on the associated epipolar line. While weaker than correspondence supervision, we observe that this cue is sufficient for finetuning existing models on new data. We then further relax the assumption of known camera poses by using pose estimates in a novel bootstrapping approach. We evaluate on highly challenging datasets, including an indoor drone dataset and an outdoor smartphone camera dataset, and obtain state-of-the-art results without strong supervision.

SCENES: Subpixel Correspondence Estimation With Epipolar Supervision

TL;DR

SCENES addresses the challenge of learning high-quality subpixel image correspondences without requiring ground-truth point matches or 3D structure. It replaces standard correspondence losses with epipolar losses that constrain matches to lie on epipolar lines derived from relative camera poses, enabling effective finetuning with pose supervision and a bootstrapping strategy that removes the need for pose data. The method demonstrates strong improvements on challenging indoor and outdoor datasets, outperforming several state-of-the-art detectors and matchers under weak or no supervision. This work broadens the applicability of learned local matching in new domains and reduces the annotation burden for adapting perception systems to novel environments.

Abstract

Extracting point correspondences from two or more views of a scene is a fundamental computer vision problem with particular importance for relative camera pose estimation and structure-from-motion. Existing local feature matching approaches, trained with correspondence supervision on large-scale datasets, obtain highly-accurate matches on the test sets. However, they do not generalise well to new datasets with different characteristics to those they were trained on, unlike classic feature extractors. Instead, they require finetuning, which assumes that ground-truth correspondences or ground-truth camera poses and 3D structure are available. We relax this assumption by removing the requirement of 3D structure, e.g., depth maps or point clouds, and only require camera pose information, which can be obtained from odometry. We do so by replacing correspondence losses with epipolar losses, which encourage putative matches to lie on the associated epipolar line. While weaker than correspondence supervision, we observe that this cue is sufficient for finetuning existing models on new data. We then further relax the assumption of known camera poses by using pose estimates in a novel bootstrapping approach. We evaluate on highly challenging datasets, including an indoor drone dataset and an outdoor smartphone camera dataset, and obtain state-of-the-art results without strong supervision.
Paper Structure (43 sections, 6 equations, 5 figures, 9 tables)

This paper contains 43 sections, 6 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: SCENES (Subpixel Correspondence EstimatioN with Epipolar Supervision) learns to find high-quality local image matches without requiring correspondence supervision. Instead, pose supervision alone is used, encouraging matches to lie on epipolar lines. The red pixel in (a) corresponds to the red epipolar line in (b). The network initially matches the red pixel to the blue pixel, but our epipolar losses preference matches on the epipolar line (not necessarily the closest point). The correspondences found by the state-of-the-art MatchFormer algorithm wang2022matchformer are displayed before (c) and after (d) finetuning with epipolar supervision. The colour denotes whether the squared symmetrical epipolar distance is below (green) or above (red) a threshold of $5\cdot10^{-4}$. Images are from the challenging EuRoC-MAV drone dataset burri2016eurocmav.
  • Figure 2: Visualisation of the epipolar classification (coarse) and regression (fine) losses. The epipolar classification loss assigns the 'ground-truth' match location to the highest probability point on the epipolar line and compares the resulting binary mask ${\mathtt{M}}^\textrm{epi}$ to the predicted confidence map ${\mathtt{C}}$ via a cross-entropy loss or similar. The epipolar regression loss computes the perpendicular distance between the predicted match $\hat{{\mathbf{x}}}_2$ and the epipolar line ${\mathbf{l}}_{12}$. Note the thin epipolar line is plotted in (a) and (b) for visualisation purposes only, it is not part of the confidence map ${\mathtt{C}}$ or binary mask ${\mathtt{M}}$.
  • Figure 3: Qualitative matching results on the EuRoC-MAV drone dataset burri2016eurocmav. The correspondences found by the state-of-the-art MatchFormer algorithm wang2022matchformer are displayed before (a, c) and after (b, d) SCENES fine-tuning. The colour denotes whether the squared symmetrical epipolar distance is below (green) or above (red) a threshold of $5\cdot10^{-4}$. Our method improves the quality and number of correct matches.
  • Figure 4: Matches for the Matchformer-lite wang2022matchformer model before (left) and after (right) SCENES-finetuning on the EuRoC-MAV dataset.
  • Figure 5: Matches for the Matchformer-lite wang2022matchformer model before (left) and after (right) SCENES-finetuning on the San Francisco dataset.