SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration
Zhimin Shao, Abhay Yadav, Rama Chellappa, Cheng Peng
TL;DR
SPIDER addresses universal image matching under unconstrained conditions by unifying a 3D Vision Foundation Model backbone with a high-resolution 2D encoder and dual coarse-to-fine heads for geometry-based and pattern-based matching. The descriptor head yields geometry-aware dense descriptors while the warp head regresses dense pixel correspondences, with their fusion enabling robust matches across wide baselines. Trained on ten diverse datasets and evaluated on multiple benchmarks, SPIDER achieves state-of-the-art performance in two-view geometry and unconstrained scenarios, including aerial-to-ground matching, and reduces planar bias via multi-scale refinement. This work bridges 2D and 3D feature representations to enable robust camera calibration and pose estimation in challenging environments.
Abstract
Reliable image correspondences form the foundation of vision-based spatial perception, enabling recovery of 3D structure and camera poses. However, unconstrained feature matching across domains such as aerial, indoor, and outdoor scenes remains challenging due to large variations in appearance, scale and viewpoint. Feature matching has been conventionally formulated as a 2D-to-2D problem; however, recent 3D foundation models provides spatial feature matching properties based on two-view geometry. While powerful, we observe that these spatially coherent matches often concentrate on dominant planar regions, e.g., walls or ground surfaces, while being less sensitive to fine-grained geometric details, particularly under large viewpoint changes. To better understand these trade-offs, we first perform linear probe experiments to evaluate the performance of various vision foundation models for image matching. Building on these insights, we introduce SPIDER, a universal feature matching framework that integrates a shared feature extraction backbone with two specialized network heads for estimating both 2D-based and 3D-based correspondences from coarse to fine. Finally, we introduce an image-matching evaluation benchmark that focuses on unconstrained scenarios with large baselines. SPIDER significantly outperforms SoTA methods, demonstrating its strong ability as a universal image-matching method.
