Table of Contents
Fetching ...

Cross-View Completion Models are Zero-shot Correspondence Estimators

Honggyu An, Jinhyeon Kim, Seonghoon Park, Jaewoo Jung, Jisang Han, Sunghwan Hong, Seungryong Kim

TL;DR

The paper analyzes cross-view completion (CVC) models as zero-shot correspondence estimators, showing that cross-attention maps encode precise geometric correspondences better than encoder or decoder features. It formalizes ZeroCo, a reciprocity-based zero-shot inference that combines cross-attention maps from original and swapped inputs, and introduces lightweight learning-based heads (ZeroCo-finetuned, ZeroCo-flow, ZeroCo-depth) to enhance dense matching and multi-frame depth estimation. Across HPatches, ETH3D, KITTI, and Cityscapes, ZeroCo achieves state-of-the-art zero-shot performance and competitive or superior results when augmented with learnable heads, often surpassing epipolar-based cost volumes in depth tasks. The work thus positions cross-attention as a rich, readily usable cost volume for geometry, enabling robust downstream tasks with minimal supervision and simple refinements.

Abstract

In this work, we explore new perspectives on cross-view completion learning by drawing an analogy to self-supervised correspondence learning. Through our analysis, we demonstrate that the cross-attention map within cross-view completion models captures correspondence more effectively than other correlations derived from encoder or decoder features. We verify the effectiveness of the cross-attention map by evaluating on both zero-shot matching and learning-based geometric matching and multi-frame depth estimation. Project page is available at https://cvlab-kaist.github.io/ZeroCo/.

Cross-View Completion Models are Zero-shot Correspondence Estimators

TL;DR

The paper analyzes cross-view completion (CVC) models as zero-shot correspondence estimators, showing that cross-attention maps encode precise geometric correspondences better than encoder or decoder features. It formalizes ZeroCo, a reciprocity-based zero-shot inference that combines cross-attention maps from original and swapped inputs, and introduces lightweight learning-based heads (ZeroCo-finetuned, ZeroCo-flow, ZeroCo-depth) to enhance dense matching and multi-frame depth estimation. Across HPatches, ETH3D, KITTI, and Cityscapes, ZeroCo achieves state-of-the-art zero-shot performance and competitive or superior results when augmented with learnable heads, often surpassing epipolar-based cost volumes in depth tasks. The work thus positions cross-attention as a rich, readily usable cost volume for geometry, enabling robust downstream tasks with minimal supervision and simple refinements.

Abstract

In this work, we explore new perspectives on cross-view completion learning by drawing an analogy to self-supervised correspondence learning. Through our analysis, we demonstrate that the cross-attention map within cross-view completion models captures correspondence more effectively than other correlations derived from encoder or decoder features. We verify the effectiveness of the cross-attention map by evaluating on both zero-shot matching and learning-based geometric matching and multi-frame depth estimation. Project page is available at https://cvlab-kaist.github.io/ZeroCo/.

Paper Structure

This paper contains 60 sections, 21 equations, 14 figures, 14 tables.

Figures (14)

  • Figure 1: Cross-view completion models weinzaepfel2022crocoweinzaepfel2023croco are zero-shot correspondence estimators. Given a pair of images consisting of target (left) and source (right) images, we visualize the attended region in the source image corresponding to a query point marked in the target image in blue, where the point with the highest attention is marked in red. Although cross-view completion models weinzaepfel2022crocoweinzaepfel2023croco are not trained with correspondence-supervision, its cross-attention map already establishes precise correspondences.
  • Figure 2: Analogy of cross-view completion and self-supervised matching learning. The cost volume learned by (b) the cross-attention layers within cross-view completion models weinzaepfel2022crocoweinzaepfel2023croco closely resembles that of (a) traditional self-supervised matching methods liu2019selflowjonschkowski2020mattersunsupervisedopticalflow.
  • Figure 3: Visualization of matching costs. We visualize the matching costs of the (d) encoder, (e) decoder, and (f) cross-attention maps in the (a) cross-view completion model weinzaepfel2022crocoweinzaepfel2023croco. The cross-attention exhibits the sharpest attention, while the encoder and decoder correlations exhibit broader attention, indicating that geometric cues are most effectively captured in the cross-attention maps.
  • Figure 4: Visualization of the attention map with and without the register token. The initial cross-attention map of CroCo weinzaepfel2023croco often contains artifacts due to the register tokens as in (c). After correcting this, the proper attending point is identified as in (d).
  • Figure 5: Visualization of matching costs in previous zero-shot matching methods tang2023emergentzhang2024tale, encoder and decoder features within cross-view completion models, and our ZeroCo.
  • ...and 9 more figures