Table of Contents
Fetching ...

RoMa v2: Harder Better Faster Denser Feature Matching

Johan Edstedt, David Nordström, Yushan Zhang, Georg Bökman, Jonathan Astermark, Viktor Larsson, Anders Heyden, Fredrik Kahl, Mårten Wadenbäck, Michael Felsberg

TL;DR

RoMa v2 tackles dense feature matching by presenting a robust two-stage pipeline that couples a fast coarse matcher with lightweight refiners. It introduces a novel matching objective with an auxiliary NLL term and, crucially, predicts a per-pixel predictive covariance to quantify uncertainty during refinement. The approach leverages frozen DINOv3 features in a Multi-view Transformer, trains on a diverse mix of wide and small baseline datasets, and employs an EMA bias remedy to stabilize subpixel refinement. Empirically, RoMa v2 achieves state-of-the-art accuracy across benchmarks with favorable runtime and memory trade-offs, and the inclusion of covariance improves downstream pose estimation and RANSAC-based refinement.

Abstract

Dense feature matching aims to estimate all correspondences between two images of a 3D scene and has recently been established as the gold-standard due to its high accuracy and robustness. However, existing dense matchers still fail or perform poorly for many hard real-world scenarios, and high-precision models are often slow, limiting their applicability. In this paper, we attack these weaknesses on a wide front through a series of systematic improvements that together yield a significantly better model. In particular, we construct a novel matching architecture and loss, which, combined with a curated diverse training distribution, enables our model to solve many complex matching tasks. We further make training faster through a decoupled two-stage matching-then-refinement pipeline, and at the same time, significantly reduce refinement memory usage through a custom CUDA kernel. Finally, we leverage the recent DINOv3 foundation model along with multiple other insights to make the model more robust and unbiased. In our extensive set of experiments we show that the resulting novel matcher sets a new state-of-the-art, being significantly more accurate than its predecessors. Code is available at https://github.com/Parskatt/romav2

RoMa v2: Harder Better Faster Denser Feature Matching

TL;DR

RoMa v2 tackles dense feature matching by presenting a robust two-stage pipeline that couples a fast coarse matcher with lightweight refiners. It introduces a novel matching objective with an auxiliary NLL term and, crucially, predicts a per-pixel predictive covariance to quantify uncertainty during refinement. The approach leverages frozen DINOv3 features in a Multi-view Transformer, trains on a diverse mix of wide and small baseline datasets, and employs an EMA bias remedy to stabilize subpixel refinement. Empirically, RoMa v2 achieves state-of-the-art accuracy across benchmarks with favorable runtime and memory trade-offs, and the inclusion of covariance improves downstream pose estimation and RANSAC-based refinement.

Abstract

Dense feature matching aims to estimate all correspondences between two images of a 3D scene and has recently been established as the gold-standard due to its high accuracy and robustness. However, existing dense matchers still fail or perform poorly for many hard real-world scenarios, and high-precision models are often slow, limiting their applicability. In this paper, we attack these weaknesses on a wide front through a series of systematic improvements that together yield a significantly better model. In particular, we construct a novel matching architecture and loss, which, combined with a curated diverse training distribution, enables our model to solve many complex matching tasks. We further make training faster through a decoupled two-stage matching-then-refinement pipeline, and at the same time, significantly reduce refinement memory usage through a custom CUDA kernel. Finally, we leverage the recent DINOv3 foundation model along with multiple other insights to make the model more robust and unbiased. In our extensive set of experiments we show that the resulting novel matcher sets a new state-of-the-art, being significantly more accurate than its predecessors. Code is available at https://github.com/Parskatt/romav2

Paper Structure

This paper contains 51 sections, 19 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Radar chart of performance on benchmarks. RoMa v2 outperforms previous dense matchers on a wide range of pose estimation and dense matching tasks. Further details on these experiments can be found in \ref{['sec:experiments']}.
  • Figure 2: Qualitative results. RoMa v2 excels at matching in diverse scenarios. We show a snapshot of results from different benchmarks. Below each image pair we visualize the dense warp by coloring each pixel by the RGB value from its estimated corresponding location in the opposite image. Brighter values mean lower warp confidence as output by the model.
  • Figure 3: Overview of RoMa v2. We estimate bidirectional dense image warps $\mathbf{W} = \{\mathbf{W}^{A\mapsto B}\in \mathbb{R}^{H\times W\times 2}, \mathbf{W}^{B\mapsto A}\in \mathbb{R}^{H\times W\times 2}\}$ and warp confidences $\mathbf{p} = \{\mathbf{p}^{A\mapsto B}\in \mathbb{R}^{H\times W\times 1}, \mathbf{p}^{B\mapsto A}\in \mathbb{R}^{H\times W\times 1}\}$ between two input images using a two-stage pipeline consisting of a matching and refinement stage. Different from recent SotA dense matchers, we additionally predict a precision matrix $\mathbf{\Sigma}^{-1} = \{(\mathbf{\Sigma}^{-1})^{A\mapsto B}\in \mathbb{R}^{H\times W\times 2 \times 2}, (\mathbf{\Sigma}^{-1})^{B\mapsto A}\in \mathbb{R}^{H\times W\times 2\times 2}\}$. The coarse matcher is a Multi-view Transformer, that takes in frozen DINOv3 siméoni2025dinov3 foundation model features from image $\mathbf{I}^{A}\in\mathbb{R}^{H\times W \times 3}$ and $\mathbf{I}^{B}\in\mathbb{R}^{H\times W \times 3}$. Its internals are further illustrated in \ref{['fig:coarse-matcher']}, and explained in detail in \ref{['sec:matcher']}. The refiners are fine-grained UNet-like CNN models that, conditioned on the previous warp and confidence, produce displacements and delta confidences. Besides this, they additionally predict a full $2\times 2$ precision matrix per-pixel, which is visualized as $\abs[]{\mathbf{\Sigma}^{-1}}^{-1/4}$. The refiners are further illustrated in \ref{['fig:refiners']} and explained in more detail in \ref{['sec:refinement']}.
  • Figure 4: Coarse matcher. We use a frozen DINOv3 feature extractor in the coarse matching stage. DINOv3 features from both input images are input to a Multi-view Transformer utilizing alternating Attention. Dense Prediction Transformer (DPT) ranftl2021vision heads output coarse warps $\mathbf{W}$ between the images and confidences $\mathbf{p}$ for 4x downsampled resolution.
  • Figure 5: Refiner internals. The coarse matcher predicts at a resolution 4x smaller than the original image size. The refiners output at the original resolution.
  • ...and 9 more figures