Table of Contents
Fetching ...

HyVGGT-VO: Tightly Coupled Hybrid Dense Visual Odometry with Feed-Forward Models

Junxiang Pan, Lipu Zhou, Baojie Chen

Abstract

Dense visual odometry (VO), which provides pose estimation and dense 3D reconstruction, serves as the cornerstone for applications ranging from robotics to augmented reality. Recently, feed-forward models have demonstrated remarkable capabilities in dense mapping. However, when these models are used in dense visual SLAM systems, their heavy computational burden restricts them to yielding sparse pose outputs at keyframes while still failing to achieve real-time pose estimation. In contrast, traditional sparse methods provide high computational efficiency and high-frequency pose outputs, but lack the capability for dense reconstruction. To address these limitations, we propose HyVGGT-VO, a novel framework that combines the computational efficiency of sparse VO with the dense reconstruction capabilities of feed-forward models. To the best of our knowledge, this is the first work to tightly couple a traditional VO framework with VGGT, a state-of-the-art feed-forward model. Specifically, we design an adaptive hybrid tracking frontend that dynamically switches between traditional optical flow and the VGGT tracking head to ensure robustness. Furthermore, we introduce a hierarchical optimization framework that jointly refines VO poses and the scale of VGGT predictions to ensure global scale consistency. Our approach achieves an approximately 5x processing speedup compared to existing VGGT-based methods, while reducing the average trajectory error by 85% on the indoor EuRoC dataset and 12% on the outdoor KITTI benchmark. Our code will be publicly available upon acceptance. Project page: https://geneta2580.github.io/HyVGGT-VO.io.

HyVGGT-VO: Tightly Coupled Hybrid Dense Visual Odometry with Feed-Forward Models

Abstract

Dense visual odometry (VO), which provides pose estimation and dense 3D reconstruction, serves as the cornerstone for applications ranging from robotics to augmented reality. Recently, feed-forward models have demonstrated remarkable capabilities in dense mapping. However, when these models are used in dense visual SLAM systems, their heavy computational burden restricts them to yielding sparse pose outputs at keyframes while still failing to achieve real-time pose estimation. In contrast, traditional sparse methods provide high computational efficiency and high-frequency pose outputs, but lack the capability for dense reconstruction. To address these limitations, we propose HyVGGT-VO, a novel framework that combines the computational efficiency of sparse VO with the dense reconstruction capabilities of feed-forward models. To the best of our knowledge, this is the first work to tightly couple a traditional VO framework with VGGT, a state-of-the-art feed-forward model. Specifically, we design an adaptive hybrid tracking frontend that dynamically switches between traditional optical flow and the VGGT tracking head to ensure robustness. Furthermore, we introduce a hierarchical optimization framework that jointly refines VO poses and the scale of VGGT predictions to ensure global scale consistency. Our approach achieves an approximately 5x processing speedup compared to existing VGGT-based methods, while reducing the average trajectory error by 85% on the indoor EuRoC dataset and 12% on the outdoor KITTI benchmark. Our code will be publicly available upon acceptance. Project page: https://geneta2580.github.io/HyVGGT-VO.io.

Paper Structure

This paper contains 16 sections, 6 equations, 7 figures, 6 tables.

Figures (7)

  • Figure A1: Motivation and qualitative comparison of dense 3D reconstructions on the EuRoC dataset. Due to its reliance on heavy sub-graph inference, VGGT-SLAM 2.0 vggt-slam2.0 yields sparse, delayed poses at 3.3 FPS and exhibits severe global scale drift, as seen in the MH01 (b) and V101 (d) sequences. In contrast, our HyVGGT-VO achieves real-time execution at 16.1 FPS, producing globally consistent dense maps for both MH01 (a) and V101 (c) while delivering continuous, high-frequency poses and smoother trajectories.
  • Figure C1: Overall architecture of the proposed HyVGGT-VO framework. Taking a monocular image stream as input, the system features a hybrid tracking frontend and an asynchronous hierarchical optimization backend. The frontend adaptively couples efficient KLT optical flow with a robust VGGT tracking head to handle visual degradation. In the backend, the first stage performs covisibility-based local BA for metric precision, while the second stage executes an asynchronous local PGO incorporating VGGT-predicted relative poses and an explicitly optimized scale factor to ensure consistency.
  • Figure C2: Structure of our local PGO. The first frame of the first sub-graph is fixed as the world reference frame.
  • Figure D1: Qualitative comparison of dense 3D indoor reconstructions on the EuRoC V101 sequence. Prominently, (b) VGGT-SLAM exhibits severe scale drift and structural distortion due to the propagation of $Sim(3)$ alignment errors, while (d) DROID-SLAM yields a highly sparse map caused by its aggressive confidence filtering and down-sample strategy. In contrast, both (a) MASt3R-SLAM and our proposed method (c) maintain global structural integrity. However, as highlighted in the enlarged detailed views, our method preserves significantly richer structural details and superior geometric consistency compared to (a).
  • Figure D2: The reconstructed global map encompasses approximately 20.1 million 3D points over a 920 meter trajectory (visualized after voxel downsampling). As shown in the zoomed-in views, fine details of colorful architectural structures and vegetation are preserved well, demonstrating the capability of our approach to maintain high-fidelity mapping performance in large-scale outdoor environments.
  • ...and 2 more figures