Table of Contents
Fetching ...

RUBIK: A Structured Benchmark for Image Matching across Geometric Challenges

Thibaut Loiseau, Guillaume Bourmaud

TL;DR

RUBIK is introduced, a novel benchmark that systematically evaluates image matching methods across well-defined geometric difficulty levels using three complementary criteria - overlap, scale ratio, and viewpoint angle - and reveals that while recent detector-free approaches achieve the best performance, they come with significant computational overhead.

Abstract

Camera pose estimation is crucial for many computer vision applications, yet existing benchmarks offer limited insight into method limitations across different geometric challenges. We introduce RUBIK, a novel benchmark that systematically evaluates image matching methods across well-defined geometric difficulty levels. Using three complementary criteria - overlap, scale ratio, and viewpoint angle - we organize 16.5K image pairs from nuScenes into 33 difficulty levels. Our comprehensive evaluation of 14 methods reveals that while recent detector-free approaches achieve the best performance (>47% success rate), they come with significant computational overhead compared to detector-based methods (150-600ms vs. 40-70ms). Even the best performing method succeeds on only 54.8% of the pairs, highlighting substantial room for improvement, particularly in challenging scenarios combining low overlap, large scale differences, and extreme viewpoint changes. Benchmark will be made publicly available.

RUBIK: A Structured Benchmark for Image Matching across Geometric Challenges

TL;DR

RUBIK is introduced, a novel benchmark that systematically evaluates image matching methods across well-defined geometric difficulty levels using three complementary criteria - overlap, scale ratio, and viewpoint angle - and reveals that while recent detector-free approaches achieve the best performance, they come with significant computational overhead.

Abstract

Camera pose estimation is crucial for many computer vision applications, yet existing benchmarks offer limited insight into method limitations across different geometric challenges. We introduce RUBIK, a novel benchmark that systematically evaluates image matching methods across well-defined geometric difficulty levels. Using three complementary criteria - overlap, scale ratio, and viewpoint angle - we organize 16.5K image pairs from nuScenes into 33 difficulty levels. Our comprehensive evaluation of 14 methods reveals that while recent detector-free approaches achieve the best performance (>47% success rate), they come with significant computational overhead compared to detector-based methods (150-600ms vs. 40-70ms). Even the best performing method succeeds on only 54.8% of the pairs, highlighting substantial room for improvement, particularly in challenging scenarios combining low overlap, large scale differences, and extreme viewpoint changes. Benchmark will be made publicly available.

Paper Structure

This paper contains 21 sections, 6 equations, 27 figures, 3 tables.

Figures (27)

  • Figure 1: We introduce RUBIK -- a benchmark based on the images from nuScenes for fine grain evaluation of camera pose estimations methods. RUBIK is made of image pairs spanning three difficulty criteria, in terms of scene overlap, scale ratio, and difference of viewpoint angles. It contains 16.5K image pairs across 33 difficulty levels. We use it to provide a comprehensive benchmarking of 14 methods.
  • Figure 2: Dense co-visibility map estimation -- Using normal maps ($\textcolor{myblue}{$\mathtt{N_1}$}$, $\textcolor{myred}{$\mathtt{N_2}$}$) and depth maps ($\textcolor{myblue}{$\mathtt{D_1}$}$, $\textcolor{myred}{$\mathtt{D_2}$}$) along with relative camera poses ($\mathtt{\textcolor{mygrey}{R}_{\textcolor{myblue}{1}\textcolor{myred}{2}}}$, $\mathtt{\textcolor{mygrey}{t}_{\textcolor{myblue}{1}\textcolor{myred}{2}}}$), we warp depth maps between views to obtain $\textcolor{myblue}{\hat{$\mathtt{D}$}_1}$ and $\textcolor{myred}{\hat{$\mathtt{D}$}_2}$. Geometric consistency checks classify pixels as co-visible, occluded, or outside field-of-view to obtain the co-visibilty maps $\textcolor{myblue}{$\mathtt{C}$_{1\rightarrow 2}}$ and $\textcolor{myred}{$\mathtt{C}$_{2\rightarrow 1}}$ (see \ref{['ssec:covis_gen']}). We use UniDepth piccinelli2024unidepth for metric depth estimation and Depth Anything V2 yang2024depth for normal map computation.
  • Figure 3: Camera pose alignment and filtering -- Visualization of (subsampled) camera trajectories of scene-0266 after aligning COLMAP poses with nuScenes ground truth poses. Blue crosses ($\textcolor{blue}{\times}$) indicate inlier poses (alignment error $<$ 1m) that are kept for our benchmark, while red crosses ($\textcolor{red}{\times}$) show outlier poses that are discarded.
  • Figure 4: Comparison of surface normal maps -- From left to right: input image (a), normal maps computed from UniDepth's metric depth predictions (b) and from Depth Anything V2 after alignment to UniDepth depth map (c). Note the significantly sharper object boundaries and finer geometric details in Depth Anything V2's prediction, particularly around building edges and depth discontinuities.
  • Figure 5: Two-view setup -- Considering two views, a 3D point can either be co-visible ($\textcolor{mygreen}{\bigstar}$), occluded ($\textcolor{myorange}{\bigstar}$), or outside the field of view ($\textcolor{mygrey}{\bigstar}$) in one of the views. For each co-visible 3D point, we compute its distances $\textcolor{myred}{$\mathtt{d_1}$}$ and $\textcolor{myblue}{$\mathtt{d_2}$}$ to both camera centers and the angle ${\pmb{\theta}}$ between the two lines of sight.
  • ...and 22 more figures