Table of Contents
Fetching ...

SuperPoint-E: local features for 3D reconstruction via tracking adaptation in endoscopy

O. Leon Barbed, José M. M. Montiel, Pascal Fua, Ana C. Murillo

TL;DR

This work introduces SuperPoint-E (SP-E), an endoscopy-tailored local feature extractor trained with Tracking Adaptation, a supervision strategy derived from Structure-from-Motion (SfM) reconstructions. By leveraging reliable tracks from COLMAP as ground-truth correspondences, SP-E learns detectors and descriptors that are more repeatable and discriminative in endoscopic imagery, improving downstream SfM density and coverage. The approach demonstrates superior performance over SIFT and standard SuperPoint across multiple endoscopic datasets, with SP-E showing robustness to specularities and a reduced need for complex matching pipelines. These advances enable denser 3D reconstructions and longer sequence coverage, potentially extending to SLAM and mixed-reality applications in endoscopy and related modalities.

Abstract

In this work, we focus on boosting the feature extraction to improve the performance of Structure-from-Motion (SfM) in endoscopy videos. We present SuperPoint-E, a new local feature extraction method that, using our proposed Tracking Adaptation supervision strategy, significantly improves the quality of feature detection and description in endoscopy. Extensive experimentation on real endoscopy recordings studies our approach's most suitable configuration and evaluates SuperPoint-E feature quality. The comparison with other baselines also shows that our 3D reconstructions are denser and cover more and longer video segments because our detector fires more densely and our features are more likely to survive (i.e. higher detection precision). In addition, our descriptor is more discriminative, making the guided matching step almost redundant. The presented approach brings significant improvements in the 3D reconstructions obtained, via SfM on endoscopy videos, compared to the original SuperPoint and the gold standard SfM COLMAP pipeline.

SuperPoint-E: local features for 3D reconstruction via tracking adaptation in endoscopy

TL;DR

This work introduces SuperPoint-E (SP-E), an endoscopy-tailored local feature extractor trained with Tracking Adaptation, a supervision strategy derived from Structure-from-Motion (SfM) reconstructions. By leveraging reliable tracks from COLMAP as ground-truth correspondences, SP-E learns detectors and descriptors that are more repeatable and discriminative in endoscopic imagery, improving downstream SfM density and coverage. The approach demonstrates superior performance over SIFT and standard SuperPoint across multiple endoscopic datasets, with SP-E showing robustness to specularities and a reduced need for complex matching pipelines. These advances enable denser 3D reconstructions and longer sequence coverage, potentially extending to SLAM and mixed-reality applications in endoscopy and related modalities.

Abstract

In this work, we focus on boosting the feature extraction to improve the performance of Structure-from-Motion (SfM) in endoscopy videos. We present SuperPoint-E, a new local feature extraction method that, using our proposed Tracking Adaptation supervision strategy, significantly improves the quality of feature detection and description in endoscopy. Extensive experimentation on real endoscopy recordings studies our approach's most suitable configuration and evaluates SuperPoint-E feature quality. The comparison with other baselines also shows that our 3D reconstructions are denser and cover more and longer video segments because our detector fires more densely and our features are more likely to survive (i.e. higher detection precision). In addition, our descriptor is more discriminative, making the guided matching step almost redundant. The presented approach brings significant improvements in the 3D reconstructions obtained, via SfM on endoscopy videos, compared to the original SuperPoint and the gold standard SfM COLMAP pipeline.
Paper Structure (15 sections, 3 equations, 6 figures, 5 tables)

This paper contains 15 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Tracking Adaptation overview. We run COLMAP 3D reconstructions and reproject the 3D points onto the sequence frames. Resulting reliable tracks are used as supervision to train SuperPoint-E. A new tracking loss leverages this supervision to refine the feature descriptors.
  • Figure 2: Supervision points obtained from a COLMAP reconstruction. (a) All 3D points are reprojected into each video frame. Green points were originally detected in this frame, while blue points were not. (b-d) Detail of a complete point track (all positions of one 3D point along the sequence). The reliable track for this point is the green segment. The track starts when a point is first detected (b). When the feature is not detected anymore (d), it is depicted in blue from then on and is no longer part of the reliable track.
  • Figure 3: Matching example of pairs of frames, 0.1, 0.2 and 0.5 seconds apart within two sequences in EM-Test.
  • Figure 4: Point clouds and camera trajectory reconstructed in 3 subsequences. Within each block, Left: original frames aligned with reconstructed points. Top row contains eight evenly-spaced input sequence frames, and middle and bottom rows show corresponding reconstructed camera views of the point cloud using SIFT+GM and SuperPoint-E+GM respectively. Right: SIFT and SuperPoint-E 3D reconstructions, including the camera poses (red markers).
  • Figure 5: Coverage and reconstructed sections of full endoscopy videos based on SIFT or SP-E. Bar: full video timeline, left to right. Colored blocks: frames included in the reconstructions (the larger the colored parts, the more frames COLMAP is able to reconstruct given the chosen features). Same color means same reconstruction (or "model").
  • ...and 1 more figures