Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence
Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun, Ming-Hsuan Yang
TL;DR
This work identifies geometry-aware semantic correspondence as a critical failure mode for features from large vision foundation models and demonstrates that simple post-processing can substantially improve matching when geometry is considered. It formalizes geometry-aware correspondence, analyzes its prevalence and sensitivity to pose, and introduces test-time pose alignment, a dense training objective, pose-variant augmentation, and windowed soft argmax to enhance geometric understanding. A large-scale AP-10K benchmark is built to train and evaluate geometry-aware matching, and extensive experiments show substantial improvements over state-of-the-art methods on SPair-71k and AP-10K, including significant gains in geometry-aware subsets. The findings offer practical gains for downstream tasks and provide insights into the geometric understanding embedded in foundation-model features, while highlighting remaining challenges such as small object instances and extreme deformations.
Abstract
While pre-trained large-scale vision models have shown significant promise for semantic correspondence, their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporating this information can markedly enhance semantic correspondence performance with simple but effective solutions in both zero-shot and supervised settings. We also construct a new challenging benchmark for semantic correspondence built from an existing animal pose estimation dataset, for both pre-training validating models. Our method achieves a PCK@0.10 score of 65.4 (zero-shot) and 85.6 (supervised) on the challenging SPair-71k dataset, outperforming the state of the art by 5.5p and 11.0p absolute gains, respectively. Our code and datasets are publicly available at: https://telling-left-from-right.github.io/.
