Cross-View Completion Models are Zero-shot Correspondence Estimators
Honggyu An, Jinhyeon Kim, Seonghoon Park, Jaewoo Jung, Jisang Han, Sunghwan Hong, Seungryong Kim
TL;DR
The paper analyzes cross-view completion (CVC) models as zero-shot correspondence estimators, showing that cross-attention maps encode precise geometric correspondences better than encoder or decoder features. It formalizes ZeroCo, a reciprocity-based zero-shot inference that combines cross-attention maps from original and swapped inputs, and introduces lightweight learning-based heads (ZeroCo-finetuned, ZeroCo-flow, ZeroCo-depth) to enhance dense matching and multi-frame depth estimation. Across HPatches, ETH3D, KITTI, and Cityscapes, ZeroCo achieves state-of-the-art zero-shot performance and competitive or superior results when augmented with learnable heads, often surpassing epipolar-based cost volumes in depth tasks. The work thus positions cross-attention as a rich, readily usable cost volume for geometry, enabling robust downstream tasks with minimal supervision and simple refinements.
Abstract
In this work, we explore new perspectives on cross-view completion learning by drawing an analogy to self-supervised correspondence learning. Through our analysis, we demonstrate that the cross-attention map within cross-view completion models captures correspondence more effectively than other correlations derived from encoder or decoder features. We verify the effectiveness of the cross-attention map by evaluating on both zero-shot matching and learning-based geometric matching and multi-frame depth estimation. Project page is available at https://cvlab-kaist.github.io/ZeroCo/.
