Table of Contents
Fetching ...

SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration

Zhimin Shao, Abhay Yadav, Rama Chellappa, Cheng Peng

TL;DR

SPIDER addresses universal image matching under unconstrained conditions by unifying a 3D Vision Foundation Model backbone with a high-resolution 2D encoder and dual coarse-to-fine heads for geometry-based and pattern-based matching. The descriptor head yields geometry-aware dense descriptors while the warp head regresses dense pixel correspondences, with their fusion enabling robust matches across wide baselines. Trained on ten diverse datasets and evaluated on multiple benchmarks, SPIDER achieves state-of-the-art performance in two-view geometry and unconstrained scenarios, including aerial-to-ground matching, and reduces planar bias via multi-scale refinement. This work bridges 2D and 3D feature representations to enable robust camera calibration and pose estimation in challenging environments.

Abstract

Reliable image correspondences form the foundation of vision-based spatial perception, enabling recovery of 3D structure and camera poses. However, unconstrained feature matching across domains such as aerial, indoor, and outdoor scenes remains challenging due to large variations in appearance, scale and viewpoint. Feature matching has been conventionally formulated as a 2D-to-2D problem; however, recent 3D foundation models provides spatial feature matching properties based on two-view geometry. While powerful, we observe that these spatially coherent matches often concentrate on dominant planar regions, e.g., walls or ground surfaces, while being less sensitive to fine-grained geometric details, particularly under large viewpoint changes. To better understand these trade-offs, we first perform linear probe experiments to evaluate the performance of various vision foundation models for image matching. Building on these insights, we introduce SPIDER, a universal feature matching framework that integrates a shared feature extraction backbone with two specialized network heads for estimating both 2D-based and 3D-based correspondences from coarse to fine. Finally, we introduce an image-matching evaluation benchmark that focuses on unconstrained scenarios with large baselines. SPIDER significantly outperforms SoTA methods, demonstrating its strong ability as a universal image-matching method.

SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration

TL;DR

SPIDER addresses universal image matching under unconstrained conditions by unifying a 3D Vision Foundation Model backbone with a high-resolution 2D encoder and dual coarse-to-fine heads for geometry-based and pattern-based matching. The descriptor head yields geometry-aware dense descriptors while the warp head regresses dense pixel correspondences, with their fusion enabling robust matches across wide baselines. Trained on ten diverse datasets and evaluated on multiple benchmarks, SPIDER achieves state-of-the-art performance in two-view geometry and unconstrained scenarios, including aerial-to-ground matching, and reduces planar bias via multi-scale refinement. This work bridges 2D and 3D feature representations to enable robust camera calibration and pose estimation in challenging environments.

Abstract

Reliable image correspondences form the foundation of vision-based spatial perception, enabling recovery of 3D structure and camera poses. However, unconstrained feature matching across domains such as aerial, indoor, and outdoor scenes remains challenging due to large variations in appearance, scale and viewpoint. Feature matching has been conventionally formulated as a 2D-to-2D problem; however, recent 3D foundation models provides spatial feature matching properties based on two-view geometry. While powerful, we observe that these spatially coherent matches often concentrate on dominant planar regions, e.g., walls or ground surfaces, while being less sensitive to fine-grained geometric details, particularly under large viewpoint changes. To better understand these trade-offs, we first perform linear probe experiments to evaluate the performance of various vision foundation models for image matching. Building on these insights, we introduce SPIDER, a universal feature matching framework that integrates a shared feature extraction backbone with two specialized network heads for estimating both 2D-based and 3D-based correspondences from coarse to fine. Finally, we introduce an image-matching evaluation benchmark that focuses on unconstrained scenarios with large baselines. SPIDER significantly outperforms SoTA methods, demonstrating its strong ability as a universal image-matching method.

Paper Structure

This paper contains 20 sections, 17 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Visualized in 3D, SPIDER jointly predicts pixel-wise warps and feature descriptors even across large viewpoint changes, unifying the appearance sensitivity of RoMa and the geometric consistency of Aerial-MASt3R within a single framework. This enables accurate camera calibration and pose estimation, achieving and surpassing State-Of-the-Art performance on challenging benchmarks.
  • Figure 2: Method Overview. Given two input images $I^A$ and $I^B$, our method builds on 3D VFM features and ConvNet features to combine semantic alignment and geometric consistency. A dual-head architecture operates in a coarse-to-fine manner: (1) the descriptor head aggregates multi-scale features through attention-based Fusion Gates to produce geometry-aware descriptors and confidence maps; (2) the warp head predicts dense correspondence fields and confidence maps, progressively refined across multiple scales. Final correspondences are sampled from the predicted warp and fastNN.
  • Figure 3: Visualization of feature descriptors from Aerial-MASt3R and SPIDER, based on images from the Teaser figure By introducing a multi-scale feature upsampling, SPIDER obtains significantly better resolved features to achieve accurate correspondences.
  • Figure 4: Visualization of camera positions for four multi-elevation scenes collected in unconstrained scenarios.
  • Figure 5: Visual Comparison under unconstrained settings. Image pattern-driven methods, e.g. RoMa rombach2022high, finds diverse matches across many planes; however, matches may be false negatives on two sides of the building. Geometry-driven methods leroy2024groundingvuong2025aerialmegadepth are better at matching planes. This can lead to homography if a confident plane dominates, e.g., when Aerial-MASt3r matches the wrong signs in Urban with high confidence. SPIDER combines both approaches and produces diverse and accurate matches.