Table of Contents
Fetching ...

Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence

Octave Mariotti, Zhipeng Du, Yash Bhalgat, Oisin Mac Aodha, Hakan Bilen

TL;DR

A novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation, which constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations.

Abstract

Semantic correspondence (SC) aims to establish semantically meaningful matches across different instances of an object category. We illustrate how recent supervised SC methods remain limited in their ability to generalize beyond sparsely annotated training keypoints, effectively acting as keypoint detectors. To address this, we propose a novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation. Our method constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations. Additionally, we introduce SPair-U, an extension of SPair-71k with novel keypoint annotations, to better assess generalization. Experiments not only demonstrate that our model significantly outperforms supervised baselines on unseen keypoints, highlighting its effectiveness in learning robust correspondences, but that unsupervised baselines outperform supervised counterparts when generalized across different datasets.

Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence

TL;DR

A novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation, which constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations.

Abstract

Semantic correspondence (SC) aims to establish semantically meaningful matches across different instances of an object category. We illustrate how recent supervised SC methods remain limited in their ability to generalize beyond sparsely annotated training keypoints, effectively acting as keypoint detectors. To address this, we propose a novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation. Our method constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations. Additionally, we introduce SPair-U, an extension of SPair-71k with novel keypoint annotations, to better assess generalization. Experiments not only demonstrate that our model significantly outperforms supervised baselines on unseen keypoints, highlighting its effectiveness in learning robust correspondences, but that unsupervised baselines outperform supervised counterparts when generalized across different datasets.

Paper Structure

This paper contains 19 sections, 6 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of the generalization gap on unseen keypoints. (Left) Top row: when evaluated on known keypoints, both our model and Geo-SC zhang2023telling perform well, while the unsupervised DINOv2+SD taleof2feats struggles to correctly disambiguate the legs of the horse. Bottom row: when presented with keypoints unseen at training time, both our model and DINO+SD predict noisy but reasonable correspondence, while Geo-SC predictions noticeably degrade. (Right) Even though it obtains strong performance on known keypoints, Geo-SC performs worse than its unsupervised counterpart on our new benchmark of unseen keypoints. In comparison, our model still achieves competitive results.
  • Figure 2: Overview of our approach. We extract segmentation masks and depths maps from training images and backproject object points to produce the posed point clouds $\mathcal{X}^b$. We predict dense features with $\Phi$ and match them against our jointly learned sparse category prototype $(\mathcal{P}, \mathcal{Z})$ to produce the canonical point clouds $\mathcal{X}^c$. The local geometric alignment between the two provides supervision for updating $\Phi$.
  • Figure 3: Example keypoint annotations from our new SPair-U evaluation dataset. It utilizes the same images as the SPair-71k dataset min2019spair, but adds additional keypoints not present in SPair-71k. This enables benchmarking of SC methods on the existing keypoints along with our new ones. On the right we summarize the main statistics of our new dataset.
  • Figure 4: PCA visualization of the feature maps from different models. Note that PCA is computed on object features only. The inclusion of geometric constraints during training results in fewer high frequency artifacts in the predicted feature maps for our approach.
  • Figure 5: Visualization of keypoint matches for randomly selected object points. On each source image (left) we randomly sample points on the object of interest and compute their match on the target (right). Colored lines are used as a way to distinguish the points.
  • ...and 5 more figures