Table of Contents
Fetching ...

Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

Jiahao Lu, Jiayi Xu, Wenbo Hu, Ruijie Zhu, Chengfeng Zhao, Sai-Kit Yeung, Ying Shan, Yuan Liu

TL;DR

This paper proposes a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system, and consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking.

Abstract

Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.

Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

TL;DR

This paper proposes a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system, and consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking.

Abstract

Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.
Paper Structure (46 sections, 35 equations, 10 figures, 11 tables)

This paper contains 46 sections, 35 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Overview. Given (a) the input video frames, Track4World first extracts (b) global scene representations (geometric embeddings, point clouds, and camera poses). (c) A sparse-to-dense scene flow decoder then predicts 2D-3D joint flows between arbitrary timesteps, which applies a novel 2D-to-3D correlation scheme to improve efficiency and allows 2D-3D joint supervision. (d) The pairwise flows are ultimately fused to establish holistic world-centric 3D tracking.
  • Figure 2: Comparison of correlation mechanisms. Prior methods rely on explicit $k$-nearest neighbor searches and cross-attention in 3D space, leading to high computational costs. In contrast, our proposed method anchors 3D updates directly to intermediate image-plane correlations. This design significantly improves computational efficiency and allows the 3D tracking module to be effectively boosted by abundant 2D training data.
  • Figure 3: Qualitative results on diverse in-the-wild videos.
  • Figure S1: Effectiveness of $\ell_{\text{smooth}}^{3d}$ .
  • Figure S2: Scene flow Visualization. The deformed point maps (colored ) show that our method produces more temporally consistent geometry and motion.
  • ...and 5 more figures