Table of Contents
Fetching ...

DELTA: Dense Efficient Long-range 3D Tracking for any video

Tuan Duc Ngo, Peiye Zhuang, Chuang Gan, Evangelos Kalogerakis, Sergey Tulyakov, Hsin-Ying Lee, Chaoyang Wang

TL;DR

DELTA tackles the challenge of dense, long-range 3D tracking from monocular data by tracking every pixel in 3D through a coarse-to-fine pipeline that combines a joint global-local spatial attention mechanism at reduced resolution with an attention-based upsampler to recover full-resolution trajectories. A key design choice is the log-depth representation, which improves robustness and accuracy in 3D tracking. The method achieves state-of-the-art performance on dense 2D and 3D benchmarks while being significantly faster than prior dense tracking approaches, as demonstrated on Kubric and CVO datasets and validated across real-world depth inputs. Overall, DELTA provides a scalable, end-to-end framework for fine-grained, long-term motion tracking in 3D space with strong generalization across datasets and depth sources.

Abstract

Tracking dense 3D motion from monocular videos remains challenging, particularly when aiming for pixel-level precision over long sequences. We introduce DELTA, a novel method that efficiently tracks every pixel in 3D space, enabling accurate motion estimation across entire videos. Our approach leverages a joint global-local attention mechanism for reduced-resolution tracking, followed by a transformer-based upsampler to achieve high-resolution predictions. Unlike existing methods, which are limited by computational inefficiency or sparse tracking, DELTA delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy. Furthermore, we explore the impact of depth representation on tracking performance and identify log-depth as the optimal choice. Extensive experiments demonstrate the superiority of DELTA on multiple benchmarks, achieving new state-of-the-art results in both 2D and 3D dense tracking tasks. Our method provides a robust solution for applications requiring fine-grained, long-term motion tracking in 3D space.

DELTA: Dense Efficient Long-range 3D Tracking for any video

TL;DR

DELTA tackles the challenge of dense, long-range 3D tracking from monocular data by tracking every pixel in 3D through a coarse-to-fine pipeline that combines a joint global-local spatial attention mechanism at reduced resolution with an attention-based upsampler to recover full-resolution trajectories. A key design choice is the log-depth representation, which improves robustness and accuracy in 3D tracking. The method achieves state-of-the-art performance on dense 2D and 3D benchmarks while being significantly faster than prior dense tracking approaches, as demonstrated on Kubric and CVO datasets and validated across real-world depth inputs. Overall, DELTA provides a scalable, end-to-end framework for fine-grained, long-term motion tracking in 3D space with strong generalization across datasets and depth sources.

Abstract

Tracking dense 3D motion from monocular videos remains challenging, particularly when aiming for pixel-level precision over long sequences. We introduce DELTA, a novel method that efficiently tracks every pixel in 3D space, enabling accurate motion estimation across entire videos. Our approach leverages a joint global-local attention mechanism for reduced-resolution tracking, followed by a transformer-based upsampler to achieve high-resolution predictions. Unlike existing methods, which are limited by computational inefficiency or sparse tracking, DELTA delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy. Furthermore, we explore the impact of depth representation on tracking performance and identify log-depth as the optimal choice. Extensive experiments demonstrate the superiority of DELTA on multiple benchmarks, achieving new state-of-the-art results in both 2D and 3D dense tracking tasks. Our method provides a robust solution for applications requiring fine-grained, long-term motion tracking in 3D space.

Paper Structure

This paper contains 20 sections, 4 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: DELTA is a dense 3D tracking approach that (a) tracks every pixel from a monocular video, (b) provides consistent trajectories in 3D space, and (c) achieves state-of-the-art accuracy on 3D tracking benchmarks while being significantly faster than previous methods in the dense setting.
  • Figure 2: Overview of DELTA. DELTA takes RGB-D videos as input and achieves efficient dense 3D tracking using a coarse-to-fine strategy, beginning with coarse tracking through a spatio-temporal attention mechanism at reduced resolution (Sec. \ref{['sec:preliminary']}, \ref{['sec:spatial_attention']}), followed by an attention-based upsampler for high-resolution predictions (Sec. \ref{['sec:upsample']}).
  • Figure 3: Spatial attention architectures.Top: Illustration of different spatial attention architectures. Compared to prior methods, our proposed architecture ③ incorporates both global and local spatial attention and can be efficiently learned using a patch-by-patch strategy. Bottom: Long-term optical flows predicted with different spatial attention designs. We find that both global and local attention are crucial for improving tracking accuracy, as highlighted by the red circles. Additionally, our computationally efficient global attention design using anchor tracks (i.e., ③ W/o Local Attn) achieves similar accuracy to the more computationally-intensive CoTracker version ②.
  • Figure 4: Attention-based upsample module. Left: We apply multiple blocks of local cross-attention to learn the upsampling weights for each pixel in the fine resolution. Right: The red circles highlight regions in the long-term flow maps where our attention-based upsampler produces more accurate predictions compared to RAFT's convolution-based upsampler.
  • Figure 5: Qualitative results of dense 3D tracking on in-the-wild videos between CoTracker $+$ UniDepth, SceneTracker, SpatialTracker and our method. We densely track every pixel from the first frame of the video in 3D space, the moving objects are highlighted as rainbow color. Our method accurately tracks the motion of foreground objects while maintaining stable backgrounds.
  • ...and 4 more figures