Table of Contents
Fetching ...

DELTAv2: Accelerating Dense 3D Tracking

Tuan Duc Ngo, Ashkan Mirzaei, Guocheng Qian, Hanwen Liang, Chuang Gan, Evangelos Kalogerakis, Peter Wonka, Chaoyang Wang

TL;DR

DELTAv2 tackles the challenge of dense long-range 3D tracking by addressing two main bottlenecks in prior methods: the heavy transformer computations across many trajectories and the cost of 4D correlation features. It introduces a coarse-to-fine tracking scheme that subsamples trajectories and progressively densifies them, paired with a learnable interpolation module to propagate motion to untracked pixels. Additionally, it optimizes the 4D correlation computation with a lightweight projection to improve GPU utilization. The combined approach yields roughly 5x speedups over DELTA while maintaining state-of-the-art accuracy, enabling more practical real-time or large-scale dense 3D tracking on RGB-D videos.

Abstract

We propose a novel algorithm for accelerating dense long-term 3D point tracking in videos. Through analysis of existing state-of-the-art methods, we identify two major computational bottlenecks. First, transformer-based iterative tracking becomes expensive when handling a large number of trajectories. To address this, we introduce a coarse-to-fine strategy that begins tracking with a small subset of points and progressively expands the set of tracked trajectories. The newly added trajectories are initialized using a learnable interpolation module, which is trained end-to-end alongside the tracking network. Second, we propose an optimization that significantly reduces the cost of correlation feature computation, another key bottleneck in prior methods. Together, these improvements lead to a 5-100x speedup over existing approaches while maintaining state-of-the-art tracking accuracy.

DELTAv2: Accelerating Dense 3D Tracking

TL;DR

DELTAv2 tackles the challenge of dense long-range 3D tracking by addressing two main bottlenecks in prior methods: the heavy transformer computations across many trajectories and the cost of 4D correlation features. It introduces a coarse-to-fine tracking scheme that subsamples trajectories and progressively densifies them, paired with a learnable interpolation module to propagate motion to untracked pixels. Additionally, it optimizes the 4D correlation computation with a lightweight projection to improve GPU utilization. The combined approach yields roughly 5x speedups over DELTA while maintaining state-of-the-art accuracy, enabling more practical real-time or large-scale dense 3D tracking on RGB-D videos.

Abstract

We propose a novel algorithm for accelerating dense long-term 3D point tracking in videos. Through analysis of existing state-of-the-art methods, we identify two major computational bottlenecks. First, transformer-based iterative tracking becomes expensive when handling a large number of trajectories. To address this, we introduce a coarse-to-fine strategy that begins tracking with a small subset of points and progressively expands the set of tracked trajectories. The newly added trajectories are initialized using a learnable interpolation module, which is trained end-to-end alongside the tracking network. Second, we propose an optimization that significantly reduces the cost of correlation feature computation, another key bottleneck in prior methods. Together, these improvements lead to a 5-100x speedup over existing approaches while maintaining state-of-the-art tracking accuracy.

Paper Structure

This paper contains 20 sections, 3 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: DELTAv2 achieves state-of-the-art for dense 3D tracking — matching the accuracy of DELTA while being $5\times$ faster, and outperforming all prior methods in speed–accuracy tradeoff. (Left) Long-range 3D trajectories on real-world videos. (Right) Performance vs. FPS comparison.
  • Figure 2: Runtime breakdown of DELTA ngo2024delta with a single iteration and one sliding window.
  • Figure 3: A modern long-range tracking pipeline (we omit the 3D tracking parts here for simplicity)
  • Figure 4: Overview of our proposed framework. (Left) Traditional iterative dense tracking refines all trajectories at every iteration, leading to high computational cost. (Middle) Our coarse-to-fine iterative dense tracking reduces computation by subsampling trajectory points in early iterations and progressively increasing the density across iterations. (Right) A learnable interpolation module leverages attention to infer untracked motions from nearby tracked pixels, enabling efficient and adaptive trajectory propagation.
  • Figure 5: Analysis of the coarse-to-fine strategy. We visualize how accuracy evolves with runtime across different methods. (a) We evaluate runtime per iteration and observe that the coarse-to-fine strategy consistently reduces runtime compared to the baseline while achieving similar accuracy. (b) Nearest neighbor interpolation outperforms bilinear, and our learnable interpolation further improves accuracy. (c) We compare different coarse-to-fine scheduling strategies.
  • ...and 7 more figures