Table of Contents
Fetching ...

UniCorrn: Unified Correspondence Transformer Across 2D and 3D

Prajnan Goswami, Tianye Ding, Feng Liu, Huaizu Jiang

Abstract

Visual correspondence across image-to-image (2D-2D), image-to-point cloud (2D-3D), and point cloud-to-point cloud (3D-3D) geometric matching forms the foundation for numerous 3D vision tasks. Despite sharing a similar problem structure, current methods use task-specific designs with separate models for each modality combination. We present UniCorrn, the first correspondence model with shared weights that unifies geometric matching across all three tasks. Our key insight is that Transformer attention naturally captures cross-modal feature similarity. We propose a dual-stream decoder that maintains separate appearance and positional feature streams. This design enables end-to-end learning through stack-able layers while supporting flexible query-based correspondence estimation across heterogeneous modalities. Our architecture employs modality-specific backbones followed by shared encoder and decoder components, trained jointly on diverse data combining pseudo point clouds from depth maps with real 3D correspondence annotations. UniCorrn achieves competitive performance on 2D-2D matching and surpasses prior state-of-the-art by 8% on 7Scenes (2D-3D) and 10% on 3DLoMatch (3D-3D) in registration recall. Project website: https://neu-vi.github.io/UniCorrn

UniCorrn: Unified Correspondence Transformer Across 2D and 3D

Abstract

Visual correspondence across image-to-image (2D-2D), image-to-point cloud (2D-3D), and point cloud-to-point cloud (3D-3D) geometric matching forms the foundation for numerous 3D vision tasks. Despite sharing a similar problem structure, current methods use task-specific designs with separate models for each modality combination. We present UniCorrn, the first correspondence model with shared weights that unifies geometric matching across all three tasks. Our key insight is that Transformer attention naturally captures cross-modal feature similarity. We propose a dual-stream decoder that maintains separate appearance and positional feature streams. This design enables end-to-end learning through stack-able layers while supporting flexible query-based correspondence estimation across heterogeneous modalities. Our architecture employs modality-specific backbones followed by shared encoder and decoder components, trained jointly on diverse data combining pseudo point clouds from depth maps with real 3D correspondence annotations. UniCorrn achieves competitive performance on 2D-2D matching and surpasses prior state-of-the-art by 8% on 7Scenes (2D-3D) and 10% on 3DLoMatch (3D-3D) in registration recall. Project website: https://neu-vi.github.io/UniCorrn

Paper Structure

This paper contains 23 sections, 16 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: UniCorrn is a unified correspondence transformer that can find correspondences of keypoints of interest across 2D and 3D.
  • Figure 2: Illustration of the overall architecture design. Our model consists of four main modules: (1) modality-specific backbone, (2) feature fusion encoder, (3) matching decoder, and (4) modality-specific prediction heads. Details of each module can be found in Sec. \ref{['subsec: network architecture']}.
  • Figure 3: Dual-stream attention with a single attention matrix (matching cost). The appearance and position features are concatenated along the channel dimension to process them in parallel. After applying attention, the output is split to update the corresponding appearance$\mathbf{F}_k$ and positional$\mathbf{P}_k$ residual streams.
  • Figure 4: Top: AUC vs. number of matching decoder layers. Bottom: AUC vs. feature upsampling ratio. The results are obtained on the MegaDepth-1500 dataset.
  • Figure 5: Visual results of 2D-2D matching on MegaDepth.Green/red lines indicate inlier/outlier correspondences. Zoom in for details.
  • ...and 9 more figures