Table of Contents
Fetching ...

Learning Feature Descriptors using Camera Pose Supervision

Qianqian Wang, Xiaowei Zhou, Bharath Hariharan, Noah Snavely

TL;DR

CAPS tackles the problem of learning local feature descriptors without dense pixel-level correspondences by using relative camera poses as supervision. It introduces an epipolar-based loss and a cycle-consistency constraint, optimized via a differentiable matching layer that represents correspondence as an expectation over a learned distribution, all within a coarse-to-fine architecture for efficiency. Empirically, CAPS achieves state-of-the-art performance on HPatches and downstream 3D tasks, even when trained with only pose supervision, and can further improve with ground-truth matches. This approach enables scalable descriptor learning on large, diverse datasets, improving generalization for real-world two-view geometry and SfM pipelines.

Abstract

Recent research on learned visual descriptors has shown promising improvements in correspondence estimation, a key component of many 3D vision tasks. However, existing descriptor learning frameworks typically require ground-truth correspondences between feature points for training, which are challenging to acquire at scale. In this paper we propose a novel weakly-supervised framework that can learn feature descriptors solely from relative camera poses between images. To do so, we devise both a new loss function that exploits the epipolar constraint given by camera poses, and a new model architecture that makes the whole pipeline differentiable and efficient. Because we no longer need pixel-level ground-truth correspondences, our framework opens up the possibility of training on much larger and more diverse datasets for better and unbiased descriptors. We call the resulting descriptors CAmera Pose Supervised, or CAPS, descriptors. Though trained with weak supervision, CAPS descriptors outperform even prior fully-supervised descriptors and achieve state-of-the-art performance on a variety of geometric tasks. Project Page: https://qianqianwang68.github.io/CAPS/

Learning Feature Descriptors using Camera Pose Supervision

TL;DR

CAPS tackles the problem of learning local feature descriptors without dense pixel-level correspondences by using relative camera poses as supervision. It introduces an epipolar-based loss and a cycle-consistency constraint, optimized via a differentiable matching layer that represents correspondence as an expectation over a learned distribution, all within a coarse-to-fine architecture for efficiency. Empirically, CAPS achieves state-of-the-art performance on HPatches and downstream 3D tasks, even when trained with only pose supervision, and can further improve with ground-truth matches. This approach enables scalable descriptor learning on large, diverse datasets, improving generalization for real-world two-view geometry and SfM pipelines.

Abstract

Recent research on learned visual descriptors has shown promising improvements in correspondence estimation, a key component of many 3D vision tasks. However, existing descriptor learning frameworks typically require ground-truth correspondences between feature points for training, which are challenging to acquire at scale. In this paper we propose a novel weakly-supervised framework that can learn feature descriptors solely from relative camera poses between images. To do so, we devise both a new loss function that exploits the epipolar constraint given by camera poses, and a new model architecture that makes the whole pipeline differentiable and efficient. Because we no longer need pixel-level ground-truth correspondences, our framework opens up the possibility of training on much larger and more diverse datasets for better and unbiased descriptors. We call the resulting descriptors CAmera Pose Supervised, or CAPS, descriptors. Though trained with weak supervision, CAPS descriptors outperform even prior fully-supervised descriptors and achieve state-of-the-art performance on a variety of geometric tasks. Project Page: https://qianqianwang68.github.io/CAPS/

Paper Structure

This paper contains 13 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of our method. Our model can learn descriptor using only relative camera poses (e.g., from SfM reconstructions (a)). Knowing camera poses, we obtain epipolar constraints illustrated in (b), where points in the first image correspond to the epipolar lines in same color in the second image. We utilize such epipolar constraints as our supervision signal (see Fig. \ref{['fig:loss_function']}). (c) shows that at inference, our descriptors establish reliable correspondences even for challenging image pairs.
  • Figure 2: Epipolar loss and cycle consistency loss.$\mathbf{x}_1$ (yellow) is the query point, and $\mathbf{\hat{x}}_2$ (orange) is the predicted correspondence. The epipolar loss $\mathcal{L}_{\textit{ep}}$ is the distance between $\mathbf{\hat{x}}_2$ and ground-truth epipolar line $\mathbf{Fx}_1$. The cycle consistency loss $\mathcal{L}_{\textit{cy}}$ is the $L_2$ distance between $\mathbf{x}_1$ and its forward-backward corresponding point (green).
  • Figure 3: Network architecture design.(a) differentiable matching layer. For a query point, its correspondence location is represented as the expectation of a distribution computed from the correlation between feature descriptors. (b) The coarse-to-fine module. We use the location of highest probability at coarse level (red circle) to determine the location of a local window $W$ at the fine level. During training, we compute the correspondence locations at both coarse and fine level from distribution $p^c$ and $p^f$, respectively, and impose our loss functions on both. This allows us to train both coarse- and fine-level features simultaneously.
  • Figure 4: Mean matching accuracy (MMA) on HPatches hpatches_2017_cvpr. For each method, we show the MMA with varying pixel error thresholds. We also report the mean number of detected features and mutual nearest neighbor matches. With SuperPoint detone2018superpoint keypoints, our approach achieves the best overall performance after $2$px.
  • Figure 5: Dense feature matching on HPatches.(a) PCK comparison. CAPS outperforms other methods at larger pixel thresholds ($>$ 4px). (b) Qualitative result of dense feature matching. Color indicates correspondence.
  • ...and 2 more figures