Table of Contents
Fetching ...

Semantic Aware Feature Extraction for Enhanced 3D Reconstruction

Ronald Nap, Andy Xiao

Abstract

Feature matching is a fundamental problem in computer vision with wide-ranging applications, including simultaneous localization and mapping (SLAM), image stitching, and 3D reconstruction. While recent advances in deep learning have improved keypoint detection and description, most approaches focus primarily on geometric attributes and often neglect higher-level semantic information. This work proposes a semantic-aware feature extraction framework that employs multi-task learning to jointly train keypoint detection, keypoint description, and semantic segmentation. The method is benchmarked against standard feature matching techniques and evaluated in the context of 3D reconstruction. To enhance feature correspondence, a deep matching module is integrated. The system is tested using input from a single monocular fisheye camera mounted on a vehicle and evaluated within a multi-floor parking structure. The proposed approach supports semantic 3D reconstruction with altitude estimation, capturing elevation changes and enabling multi-level mapping. Experimental results demonstrate that the method produces semantically annotated 3D point clouds with improved structural detail and elevation information, underscoring the effectiveness of joint training with semantic cues for more consistent feature matching and enhanced 3D reconstruction.

Semantic Aware Feature Extraction for Enhanced 3D Reconstruction

Abstract

Feature matching is a fundamental problem in computer vision with wide-ranging applications, including simultaneous localization and mapping (SLAM), image stitching, and 3D reconstruction. While recent advances in deep learning have improved keypoint detection and description, most approaches focus primarily on geometric attributes and often neglect higher-level semantic information. This work proposes a semantic-aware feature extraction framework that employs multi-task learning to jointly train keypoint detection, keypoint description, and semantic segmentation. The method is benchmarked against standard feature matching techniques and evaluated in the context of 3D reconstruction. To enhance feature correspondence, a deep matching module is integrated. The system is tested using input from a single monocular fisheye camera mounted on a vehicle and evaluated within a multi-floor parking structure. The proposed approach supports semantic 3D reconstruction with altitude estimation, capturing elevation changes and enabling multi-level mapping. Experimental results demonstrate that the method produces semantically annotated 3D point clouds with improved structural detail and elevation information, underscoring the effectiveness of joint training with semantic cues for more consistent feature matching and enhanced 3D reconstruction.
Paper Structure (14 sections, 9 equations, 5 figures, 1 table)

This paper contains 14 sections, 9 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Top row: correspondences predicted by our method, middle row: by SuperGlue, and bottom row: by SIFT. Correct matches are shown in green, incorrect in red. A sequential feature matching demo video can be found at https://streamable.com/8vhwcs
  • Figure 2: Visualization of our semantic 3D reconstruction. Sequential feature matches and semantic masks are integrated into COLMAP, resulting in a colorized 3D reconstruction. A demonstration video is available at https://streamable.com/jv66z5.
  • Figure 3: Top: Rotation. Bottom: Homography.
  • Figure 4: Estimated trajectory vs. ground truth.
  • Figure : Figure 1: We use a single shared encoder to extract features, which pass through an initial decoding stage (Decoder 1) followed by a cross-task mixer module promoting cross-task information exchange. The refined features then undergo a second decoding stage (Decoder 2) to produce segmentation masks, semantic descriptors, and keypoint detections.