Table of Contents
Fetching ...

Long-Term 3D Point Tracking By Cost Volume Fusion

Hung Nguyen, Chanho Kim, Rigved Naukarkar, Li Fuxin

TL;DR

The paper addresses long-term point tracking in 3D by introducing an online, generalizable framework that operates directly on sequences of 3D point clouds. It fuses multiple past appearances and motion cues using a transformer-based Cost Volume Fusion module, augmented by an adaptive decoding strategy to handle dense scenes and occlusions, predicting per-point motion $v_{t,i}$ and occlusion $\,\sigma_{t+1,i}$. The approach is trained in two stages (scene-flow pretraining and long-term tracking) and demonstrates strong 3D tracking performance, outperforming 2D long-term trackers and simple scene-flow chaining on synthetic benchmarks, while maintaining online operation without test-time optimization. The work advances 3D scene understanding with potential downstream impact in AR and robotics, and shows how multi-frame cost volumes and motion priors can be effectively integrated via cross-attention for robust long-term tracking.

Abstract

Long-term point tracking is essential to understand non-rigid motion in the physical world better. Deep learning approaches have recently been incorporated into long-term point tracking, but most prior work predominantly functions in 2D. Although these methods benefit from the well-established backbones and matching frameworks, the motions they produce do not always make sense in the 3D physical world. In this paper, we propose the first deep learning framework for long-term point tracking in 3D that generalizes to new points and videos without requiring test-time fine-tuning. Our model contains a cost volume fusion module that effectively integrates multiple past appearances and motion information via a transformer architecture, significantly enhancing overall tracking performance. In terms of 3D tracking performance, our model significantly outperforms simple scene flow chaining and previous 2D point tracking methods, even if one uses ground truth depth and camera pose to backproject 2D point tracks in a synthetic scenario.

Long-Term 3D Point Tracking By Cost Volume Fusion

TL;DR

The paper addresses long-term point tracking in 3D by introducing an online, generalizable framework that operates directly on sequences of 3D point clouds. It fuses multiple past appearances and motion cues using a transformer-based Cost Volume Fusion module, augmented by an adaptive decoding strategy to handle dense scenes and occlusions, predicting per-point motion and occlusion . The approach is trained in two stages (scene-flow pretraining and long-term tracking) and demonstrates strong 3D tracking performance, outperforming 2D long-term trackers and simple scene-flow chaining on synthetic benchmarks, while maintaining online operation without test-time optimization. The work advances 3D scene understanding with potential downstream impact in AR and robotics, and shows how multi-frame cost volumes and motion priors can be effectively integrated via cross-attention for robust long-term tracking.

Abstract

Long-term point tracking is essential to understand non-rigid motion in the physical world better. Deep learning approaches have recently been incorporated into long-term point tracking, but most prior work predominantly functions in 2D. Although these methods benefit from the well-established backbones and matching frameworks, the motions they produce do not always make sense in the 3D physical world. In this paper, we propose the first deep learning framework for long-term point tracking in 3D that generalizes to new points and videos without requiring test-time fine-tuning. Our model contains a cost volume fusion module that effectively integrates multiple past appearances and motion information via a transformer architecture, significantly enhancing overall tracking performance. In terms of 3D tracking performance, our model significantly outperforms simple scene flow chaining and previous 2D point tracking methods, even if one uses ground truth depth and camera pose to backproject 2D point tracks in a synthetic scenario.
Paper Structure (24 sections, 12 equations, 6 figures, 12 tables)

This paper contains 24 sections, 12 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Long-term Tracking Framework. Given a sequence of point clouds as input, we use a U-Net based backbone to extract the point cloud's features hierarchically. For simplicity, only the decoder branch is shown. At each level, we refine the sparse motion from the previous level of the backbone for the query point by jointly considering multiple past appearances and the past motion of the query. The motion predicted at level $1$ is used as the final motion of the query point from frame t to t+1.
  • Figure 2: Cost Volume Fusion Module. We propose a novel Cost Volume Fusion Module to predict the query point motion by jointly considering multiple appearances and the past motion trajectory of the query. These appearances are used to compute a set of cost volumes, which are combined with the motion prior via cross-attention in the transformer layer, followed by an MLP. The output features from the MLP are subsequently used to predict the refined motion and the occlusion status of the query point.
  • Figure 3: Qualitative Results. We reproject the results of CoTracker into 3D and back-project that into a different view point. One can see that because of small errors in 2D leading the CoTracker result on the red circled point off the blue object at time $T$, it incurs significant 3D error which can be seen as a sudden jump in the trajectory if rendered from a novel viewpoint. (Best viewed in color)
  • Figure 4: Running Time Percentage by Components
  • Figure 5: Comparison between CoTracker and our method's predictions in the new view (i.e. 45°). Upper-left: CoTracker's results. Upper-right: Our results. Lower-left: Cotracker's predictions overlapped with GT trajectories. Lower-right: our predictions overlapped with GT trajectories. We show more results in our video.
  • ...and 1 more figures