Table of Contents
Fetching ...

TrajVG: 3D Trajectory-Coupled Visual Geometry Learning

Xingyu Miao, Weiguang Zhao, Tao Lu, Linning Yu, Mulin Yu, Yang Long, Jiangmiao Pang, Junting Dong

TL;DR

TrajVG tackles motion-induced degradation in multi-frame 3D reconstruction by predicting camera-coordinate trajectories that explicitly link per-frame point maps with relative poses. It introduces two geometric constraints—bidirectional trajectory-pointmap consistency and anchor-based pose consistency with static points—and extends to self-supervised training using pseudo 2D tracks for in-the-wild data. The approach achieves state-of-the-art or competitive results across 3D tracking, camera pose estimation, point map reconstruction, and video depth, validated through extensive experiments and ablations. By enabling mixed supervision and robust cross-frame fusion, TrajVG offers a scalable solution for dynamic scenes with practical impact on AR, robotics, and navigation.

Abstract

Feed-forward multi-frame 3D reconstruction models often degrade on videos with object motion. Global-reference becomes ambiguous under multiple motions, while the local pointmap relies heavily on estimated relative poses and can drift, causing cross-frame misalignment and duplicated structures. We propose TrajVG, a reconstruction framework that makes cross-frame 3D correspondence an explicit prediction by estimating camera-coordinate 3D trajectories. We couple sparse trajectories, per-frame local point maps, and relative camera poses with geometric consistency objectives: (i) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (ii) a pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. To scale training to in-the-wild videos where 3D trajectory labels are scarce, we reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks, enabling unified training with mixed supervision. Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.

TrajVG: 3D Trajectory-Coupled Visual Geometry Learning

TL;DR

TrajVG tackles motion-induced degradation in multi-frame 3D reconstruction by predicting camera-coordinate trajectories that explicitly link per-frame point maps with relative poses. It introduces two geometric constraints—bidirectional trajectory-pointmap consistency and anchor-based pose consistency with static points—and extends to self-supervised training using pseudo 2D tracks for in-the-wild data. The approach achieves state-of-the-art or competitive results across 3D tracking, camera pose estimation, point map reconstruction, and video depth, validated through extensive experiments and ablations. By enabling mixed supervision and robust cross-frame fusion, TrajVG offers a scalable solution for dynamic scenes with practical impact on AR, robotics, and navigation.

Abstract

Feed-forward multi-frame 3D reconstruction models often degrade on videos with object motion. Global-reference becomes ambiguous under multiple motions, while the local pointmap relies heavily on estimated relative poses and can drift, causing cross-frame misalignment and duplicated structures. We propose TrajVG, a reconstruction framework that makes cross-frame 3D correspondence an explicit prediction by estimating camera-coordinate 3D trajectories. We couple sparse trajectories, per-frame local point maps, and relative camera poses with geometric consistency objectives: (i) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (ii) a pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. To scale training to in-the-wild videos where 3D trajectory labels are scarce, we reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks, enabling unified training with mixed supervision. Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.
Paper Structure (26 sections, 9 equations, 4 figures, 10 tables)

This paper contains 26 sections, 9 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Visualization results of TrajVG on in-the-wild videos.
  • Figure 2: Overview of our model. We jointly 3D track points with point-map and camera poses that tracking provides a direct improvement on both geometry reconstruction and camera motion.
  • Figure 3: Qualitative comparison of multi-view 3D reconstruction. Our method achieves better reconstruction results in field scenarios. In contrast, baseline methods suffer from issues such as overlap and loss of detail.
  • Figure 4: The influence of semi-supervised training.