Table of Contents
Fetching ...

UFM: A Simple Path towards Unified Dense Correspondence with Flow

Yuchen Zhang, Nikhil Keetha, Chenwei Lyu, Bhuvan Jhamb, Yutian Chen, Yuheng Qiu, Jay Karhade, Shreyas Jha, Yaoyu Hu, Deva Ramanan, Sebastian Scherer, Wenshan Wang

TL;DR

<p>Dense image correspondence spans optical flow and wide-baseline matching, traditionally tackled separately. The authors introduce UFM, a simple transformer-based model that directly regresses dense flow and covisibility from a unified training set of co-visible pixels, achieving state-of-the-art or near state-of-the-art performance across both domains with substantial efficiency gains. A key contribution is training on 12 diverse datasets and a novel TA-WB benchmark to evaluate challenging wide-baseline cases, demonstrating strong zero-shot generalization and compatibility with refinement techniques. The work highlights the benefits of unified training for cross-domain robustness and sets the stage for future integration with semantic cues and fast refinement for real-time, multi-modal correspondence tasks.

Abstract

Dense image correspondence is central to many applications, such as visual odometry, 3D reconstruction, object association, and re-identification. Historically, dense correspondence has been tackled separately for wide-baseline scenarios and optical flow estimation, despite the common goal of matching content between two images. In this paper, we develop a Unified Flow & Matching model (UFM), which is trained on unified data for pixels that are co-visible in both source and target images. UFM uses a simple, generic transformer architecture that directly regresses the (u,v) flow. It is easier to train and more accurate for large flows compared to the typical coarse-to-fine cost volumes in prior work. UFM is 28% more accurate than state-of-the-art flow methods (Unimatch), while also having 62% less error and 6.7x faster than dense wide-baseline matchers (RoMa). UFM is the first to demonstrate that unified training can outperform specialized approaches across both domains. This result enables fast, general-purpose correspondence and opens new directions for multi-modal, long-range, and real-time correspondence tasks.

UFM: A Simple Path towards Unified Dense Correspondence with Flow

TL;DR

<p>Dense image correspondence spans optical flow and wide-baseline matching, traditionally tackled separately. The authors introduce UFM, a simple transformer-based model that directly regresses dense flow and covisibility from a unified training set of co-visible pixels, achieving state-of-the-art or near state-of-the-art performance across both domains with substantial efficiency gains. A key contribution is training on 12 diverse datasets and a novel TA-WB benchmark to evaluate challenging wide-baseline cases, demonstrating strong zero-shot generalization and compatibility with refinement techniques. The work highlights the benefits of unified training for cross-domain robustness and sets the stage for future integration with semantic cues and fast refinement for real-time, multi-modal correspondence tasks.

Abstract

Dense image correspondence is central to many applications, such as visual odometry, 3D reconstruction, object association, and re-identification. Historically, dense correspondence has been tackled separately for wide-baseline scenarios and optical flow estimation, despite the common goal of matching content between two images. In this paper, we develop a Unified Flow & Matching model (UFM), which is trained on unified data for pixels that are co-visible in both source and target images. UFM uses a simple, generic transformer architecture that directly regresses the (u,v) flow. It is easier to train and more accurate for large flows compared to the typical coarse-to-fine cost volumes in prior work. UFM is 28% more accurate than state-of-the-art flow methods (Unimatch), while also having 62% less error and 6.7x faster than dense wide-baseline matchers (RoMa). UFM is the first to demonstrate that unified training can outperform specialized approaches across both domains. This result enables fast, general-purpose correspondence and opens new directions for multi-modal, long-range, and real-time correspondence tasks.

Paper Structure

This paper contains 47 sections, 11 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: UFM (Unified Flow & Matching) unifies dense pixel correspondence tasks such as optical flow and wide-baseline matching. We visualize sets of $2 \times 2$ grids, where the top 2 images are the input, and the bottom 2 are images warped with forward & backward flow. UFM is able to match across a wide range of baselines, including extreme ones with little co-visible overlap.
  • Figure 2: The UFM Architecture: Two images are encoded by a shared DINOv2 encoder into patch features, concatenated, and then processed by 12 self-attention transformer layers. Intermediate tokens are decoded by separate DPT heads to regress pixel displacement and covisibility maps, representing correspondence and visibility across views.
  • Figure 3: Refinement of Correspondence by Classification: We compute a per-pixel feature map by combining (1) globally aligned features from the UFM backbone and (2) local fine features encoded by a separate U-Net. For each pixel in the source image, we first use the regression flow target to interpolate features around a local neighborhood. We then compute the attention between the source features and the features from the local neighborhood, and use it to weight-add the coordinates as a refinement value. $b$ is a constant attention bias.
  • Figure 4: UFM on Ego-Exo 4Dgrauman2024ego: UFM succeeds in matching out-of-distribution environments, camera models, and challenging viewpoint shifts, showcasing its strong generalization.
  • Figure 5: Architecture Ablation: Validation EPE for various architectures trained on the same $224\times224$ resolution data as UFM. We report performance on different val sets at Data Bound ($22.5$ M pairs) or Compute Bound (at 32 hours on 8 H100 GPU) (a) Validation Set Performance: When trained on more difficult data (such as TartanAir), UFM significantly outperforms alternatives for both bounded data and compute. (b) Training Speed Comparison: We plot the number of pairs seen during training as a function of compute, and label the number of pairs that each architecture can train on at compute bound. UFM is far more efficient than most methods (except SEA-RAFT).
  • ...and 9 more figures