Table of Contents
Fetching ...

ViBA: Implicit Bundle Adjustment with Geometric and Temporal Consistency for Robust Visual Matching

Xiaoji Niu, Yuqing Wang, Yan Wang, Hailiang Tang, Tisheng Zhang

Abstract

Most existing image keypoint detection and description methods rely on datasets with accurate pose and depth annotations, limiting scalability and generalization, and often degrading navigation and localization performance. We propose ViBA, a sustainable learning framework that integrates geometric optimization with feature learning for continuous online training on unconstrained video streams. Embedded in a standard visual odometry pipeline, it consists of an implicitly differentiable geometric residual framework: (i) an initial tracking network for inter-frame correspondences, (ii) depth-based outlier filtering, and (iii) differentiable global bundle adjustment that jointly refines camera poses and feature positions by minimizing reprojection errors. By combining geometric consistency from BA with long-term temporal consistency across frames, ViBA enforces stable and accurate feature representations. We evaluate ViBA on EuRoC and UMA datasets. Compared with state-of-the-art methods such as SuperPoint+SuperGlue, ALIKED, and LightGlue, ViBA reduces mean absolute translation error (ATE) by 12-18% and absolute rotation error (ARE) by 5-10% across sequences, while maintaining real-time inference speeds (FPS 36-91). When evaluated on unseen sequences, it retains over 90% localization accuracy, demonstrating robust generalization. These results show that ViBA supports continuous online learning with geometric and temporal consistency, consistently improving navigation and localization in real-world scenarios.

ViBA: Implicit Bundle Adjustment with Geometric and Temporal Consistency for Robust Visual Matching

Abstract

Most existing image keypoint detection and description methods rely on datasets with accurate pose and depth annotations, limiting scalability and generalization, and often degrading navigation and localization performance. We propose ViBA, a sustainable learning framework that integrates geometric optimization with feature learning for continuous online training on unconstrained video streams. Embedded in a standard visual odometry pipeline, it consists of an implicitly differentiable geometric residual framework: (i) an initial tracking network for inter-frame correspondences, (ii) depth-based outlier filtering, and (iii) differentiable global bundle adjustment that jointly refines camera poses and feature positions by minimizing reprojection errors. By combining geometric consistency from BA with long-term temporal consistency across frames, ViBA enforces stable and accurate feature representations. We evaluate ViBA on EuRoC and UMA datasets. Compared with state-of-the-art methods such as SuperPoint+SuperGlue, ALIKED, and LightGlue, ViBA reduces mean absolute translation error (ATE) by 12-18% and absolute rotation error (ARE) by 5-10% across sequences, while maintaining real-time inference speeds (FPS 36-91). When evaluated on unseen sequences, it retains over 90% localization accuracy, demonstrating robust generalization. These results show that ViBA supports continuous online learning with geometric and temporal consistency, consistently improving navigation and localization in real-world scenarios.

Paper Structure

This paper contains 31 sections, 52 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The pipeline of the proposed method. The snowflake denotes a frozen network, the flame indicates an activated network, and the spark associated with the loss marks the transition from frozen to activated.
  • Figure 2: Architecture of the point matching network. It comprises a shared encoder and a feature matching module. Blocks labeled $1 \times 1$ and $3 \times 3$ represent residual units with the corresponding convolution kernel sizes. $N$ denotes the number of local image patches.
  • Figure 3: Feature tracking and geometric initialization. The tracking network $\theta$ predicts feature observations from input images to form multi-frame feature tracks, which are then used to initialize the geometric state (camera poses and feature depths) for subsequent optimization.
  • Figure 4: Overall training loss construction pipeline integrating differentiable bundle adjustment with multi-frame temporal consistency at both trajectory and descriptor levels.
  • Figure 5: Trade-offs between accuracy and runtime at varying input resolutions.
  • ...and 1 more figures