Table of Contents
Fetching ...

MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction

Jongmin Lee, Seungyeop Kang, Sungjoo Yoo

Abstract

Establishing consistent correspondences across images is essential for 3D vision tasks such as structure-from-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose MV-RoMa, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model's consistent multi-view correspondences as high-quality tracks for SfM. Across diverse and challenging benchmarks, MV-RoMa produces more reliable correspondences and substantially denser, more accurate 3D reconstructions than existing sparse and dense matching methods. Project page: https://icetea-cv.github.io/mv-roma/.

MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction

Abstract

Establishing consistent correspondences across images is essential for 3D vision tasks such as structure-from-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose MV-RoMa, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model's consistent multi-view correspondences as high-quality tracks for SfM. Across diverse and challenging benchmarks, MV-RoMa produces more reliable correspondences and substantially denser, more accurate 3D reconstructions than existing sparse and dense matching methods. Project page: https://icetea-cv.github.io/mv-roma/.

Paper Structure

This paper contains 60 sections, 20 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview of MV-RoMa. Given a source and multiple co-visible target images, MV-RoMa jointly estimates dense correspondence fields that are geometrically consistent across views (top). These fields are then fed into the SfM pipeline, finally yielding a dense and accurate reconstructed point cloud (bottom).
  • Figure 2: Pipeline of MV-RoMa.(a) Building Tracks. Given a source view $I_0$ and a set of target views $\{I_v\}$, we first obtain initial pairwise matches from an off-the-shelf matcher and apply a sampling procedure to construct a sparse set of multi-view prior tracks. (b) Multi-view Track Prediction. The RGB images and prior tracks are fed into our Multi-view Encoder (Sec. \ref{['sec:method-encoder']}), which uses track-based cross-attention to produce geometrically consistent dense features. The Multi-view Refiner (Sec. \ref{['sec:method_decoder']}) then applies pixel-wise cross-attention to predict the final dense correspondences $W^{0\rightarrow v}$ for all target views.
  • Figure 3: Pixel-aligned multi-view attention. Target features are warped to the source grid using $W^{v\rightarrow 0}$, and per-pixel attention is performed across the aligned views. This avoids the quadratic cost of global cross-attention while refining fine-grained correspondence estimates (Sec. \ref{['sec:method_decoder']}).
  • Figure 4: Confidence selection and reciprocity filtering.(a) For each pixel $u$, we select the correspondence with the highest confidence $p_{*}^{a\rightarrow b}(u)$ from multiple predictions across groups. (b) We then apply bidirectional (forward–backward) consistency filtering, retaining matches whose cycle error is below the threshold $\epsilon_p$ (Sec. \ref{['sec:method-postprocess']}).
  • Figure 5: Group Sampling Procedure. We use a two-stage strategy: Stage 1 greedily selects targets based on selection scores, while Stage 2 generates additional groups to enforce reciprocity.
  • ...and 2 more figures