Table of Contents
Fetching ...

To Glue or Not to Glue? Classical vs Learned Image Matching for Mobile Mapping Cameras to Textured Semantic 3D Building Models

Simone Gaisbauer, Prabin Gyawali, Qilin Zhang, Olaf Wysocki, Boris Jutzi

TL;DR

This work systematically compares classical handcrafted and modern learnable feature matching methods for camera-to-textured semantic 3D building models in mobile mapping. It introduces a texture-based pose-estimation framework and evaluates methods on standard benchmarks (HPatches, Megadepth-1500) and a custom TUM2TWIN-derived dataset, using $PnP$ with $RANSAC$ and pose-ground-truth for quantitative assessment. Results show that learnable methods generally outperform traditional approaches on challenging custom data (with zero to 12 RANSAC inliers and up to $AUC=0.16$), while classical methods can still be competitive on generic benchmarks. The findings support the adoption of model-based localization with textured semantic models and provide open-source code for further development and validation, highlighting potential for landmark-based positioning and future cross-model comparison work.

Abstract

Feature matching is a necessary step for many computer vision and photogrammetry applications such as image registration, structure-from-motion, and visual localization. Classical handcrafted methods such as SIFT feature detection and description combined with nearest neighbour matching and RANSAC outlier removal have been state-of-the-art for mobile mapping cameras. With recent advances in deep learning, learnable methods have been introduced and proven to have better robustness and performance under complex conditions. Despite their growing adoption, a comprehensive comparison between classical and learnable feature matching methods for the specific task of semantic 3D building camera-to-model matching is still missing. This submission systematically evaluates the effectiveness of different feature-matching techniques in visual localization using textured CityGML LoD2 models. We use standard benchmark datasets (HPatches, MegaDepth-1500) and custom datasets consisting of facade textures and corresponding camera images (terrestrial and drone). For the latter, we evaluate the achievable accuracy of the absolute pose estimated using a Perspective-n-Point (PnP) algorithm, with geometric ground truth derived from geo-referenced trajectory data. The results indicate that the learnable feature matching methods vastly outperform traditional approaches regarding accuracy and robustness on our challenging custom datasets with zero to 12 RANSAC-inliers and zero to 0.16 area under the curve. We believe that this work will foster the development of model-based visual localization methods. Link to the code: https://github.com/simBauer/To\_Glue\_or\_not\_to\_Glue

To Glue or Not to Glue? Classical vs Learned Image Matching for Mobile Mapping Cameras to Textured Semantic 3D Building Models

TL;DR

This work systematically compares classical handcrafted and modern learnable feature matching methods for camera-to-textured semantic 3D building models in mobile mapping. It introduces a texture-based pose-estimation framework and evaluates methods on standard benchmarks (HPatches, Megadepth-1500) and a custom TUM2TWIN-derived dataset, using with and pose-ground-truth for quantitative assessment. Results show that learnable methods generally outperform traditional approaches on challenging custom data (with zero to 12 RANSAC inliers and up to ), while classical methods can still be competitive on generic benchmarks. The findings support the adoption of model-based localization with textured semantic models and provide open-source code for further development and validation, highlighting potential for landmark-based positioning and future cross-model comparison work.

Abstract

Feature matching is a necessary step for many computer vision and photogrammetry applications such as image registration, structure-from-motion, and visual localization. Classical handcrafted methods such as SIFT feature detection and description combined with nearest neighbour matching and RANSAC outlier removal have been state-of-the-art for mobile mapping cameras. With recent advances in deep learning, learnable methods have been introduced and proven to have better robustness and performance under complex conditions. Despite their growing adoption, a comprehensive comparison between classical and learnable feature matching methods for the specific task of semantic 3D building camera-to-model matching is still missing. This submission systematically evaluates the effectiveness of different feature-matching techniques in visual localization using textured CityGML LoD2 models. We use standard benchmark datasets (HPatches, MegaDepth-1500) and custom datasets consisting of facade textures and corresponding camera images (terrestrial and drone). For the latter, we evaluate the achievable accuracy of the absolute pose estimated using a Perspective-n-Point (PnP) algorithm, with geometric ground truth derived from geo-referenced trajectory data. The results indicate that the learnable feature matching methods vastly outperform traditional approaches regarding accuracy and robustness on our challenging custom datasets with zero to 12 RANSAC-inliers and zero to 0.16 area under the curve. We believe that this work will foster the development of model-based visual localization methods. Link to the code: https://github.com/simBauer/To\_Glue\_or\_not\_to\_Glue

Paper Structure

This paper contains 25 sections, 7 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Classical (top) vs learnable (bottom) feature matching methods on a 3D building model's (left) texture image (middle, red rectangle) to mobile mapping image (right) with green lines connecting inlier matches (<30 px projection error).
  • Figure 2: Camera-to-textured-model image matching overview.
  • Figure 3: Exemplary textured semantic 3D building model.
  • Figure 4: Sketch of the texturing, coordinate systems and coordinate conversion.
  • Figure 5: All the camera images that contain one facade element are referenced as pairs.
  • ...and 4 more figures