Table of Contents
Fetching ...

Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun, Ming-Hsuan Yang

TL;DR

This work identifies geometry-aware semantic correspondence as a critical failure mode for features from large vision foundation models and demonstrates that simple post-processing can substantially improve matching when geometry is considered. It formalizes geometry-aware correspondence, analyzes its prevalence and sensitivity to pose, and introduces test-time pose alignment, a dense training objective, pose-variant augmentation, and windowed soft argmax to enhance geometric understanding. A large-scale AP-10K benchmark is built to train and evaluate geometry-aware matching, and extensive experiments show substantial improvements over state-of-the-art methods on SPair-71k and AP-10K, including significant gains in geometry-aware subsets. The findings offer practical gains for downstream tasks and provide insights into the geometric understanding embedded in foundation-model features, while highlighting remaining challenges such as small object instances and extreme deformations.

Abstract

While pre-trained large-scale vision models have shown significant promise for semantic correspondence, their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporating this information can markedly enhance semantic correspondence performance with simple but effective solutions in both zero-shot and supervised settings. We also construct a new challenging benchmark for semantic correspondence built from an existing animal pose estimation dataset, for both pre-training validating models. Our method achieves a PCK@0.10 score of 65.4 (zero-shot) and 85.6 (supervised) on the challenging SPair-71k dataset, outperforming the state of the art by 5.5p and 11.0p absolute gains, respectively. Our code and datasets are publicly available at: https://telling-left-from-right.github.io/.

Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

TL;DR

This work identifies geometry-aware semantic correspondence as a critical failure mode for features from large vision foundation models and demonstrates that simple post-processing can substantially improve matching when geometry is considered. It formalizes geometry-aware correspondence, analyzes its prevalence and sensitivity to pose, and introduces test-time pose alignment, a dense training objective, pose-variant augmentation, and windowed soft argmax to enhance geometric understanding. A large-scale AP-10K benchmark is built to train and evaluate geometry-aware matching, and extensive experiments show substantial improvements over state-of-the-art methods on SPair-71k and AP-10K, including significant gains in geometry-aware subsets. The findings offer practical gains for downstream tasks and provide insights into the geometric understanding embedded in foundation-model features, while highlighting remaining challenges such as small object instances and extreme deformations.

Abstract

While pre-trained large-scale vision models have shown significant promise for semantic correspondence, their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporating this information can markedly enhance semantic correspondence performance with simple but effective solutions in both zero-shot and supervised settings. We also construct a new challenging benchmark for semantic correspondence built from an existing animal pose estimation dataset, for both pre-training validating models. Our method achieves a PCK@0.10 score of 65.4 (zero-shot) and 85.6 (supervised) on the challenging SPair-71k dataset, outperforming the state of the art by 5.5p and 11.0p absolute gains, respectively. Our code and datasets are publicly available at: https://telling-left-from-right.github.io/.
Paper Structure (29 sections, 3 equations, 23 figures, 13 tables)

This paper contains 29 sections, 3 equations, 23 figures, 13 tables.

Figures (23)

  • Figure 1: Illustration of geometry-aware correspondence.
  • Figure 2: Generated samples from SD-2-1 with the prompt (left) "A cat holding up its left front paw" and (right) "A car with the right front door open". SD has difficulty generating images that require understanding the intrinsic geometry of instances.
  • Figure 3: Annotations of geometry-aware semantic correspondence (yellow) and standard semantic correspondence (blue).
  • Figure 4: Per-category evaluation of state-of-the-art methods on SPair-71k geometry-aware subset (Geo.) and standard set. While the geometry-aware subset accounts for 60% of the total matching keypoints, we observe a substantial performance gap between the two sets for all the methods.
  • Figure 5: Evaluation of the sensitivity to pose variations. The y-axis shows the normalized difference between the best and the worst performance among 5 different azimuth-variation subsets. We report the results of the unsupervised and supervised methods on both the geometry-aware (Geo.) and standard set. The larger the value, the more sensitive the performance is to pose variation.
  • ...and 18 more figures