Long-Term 3D Point Tracking By Cost Volume Fusion
Hung Nguyen, Chanho Kim, Rigved Naukarkar, Li Fuxin
TL;DR
The paper addresses long-term point tracking in 3D by introducing an online, generalizable framework that operates directly on sequences of 3D point clouds. It fuses multiple past appearances and motion cues using a transformer-based Cost Volume Fusion module, augmented by an adaptive decoding strategy to handle dense scenes and occlusions, predicting per-point motion $v_{t,i}$ and occlusion $\,\sigma_{t+1,i}$. The approach is trained in two stages (scene-flow pretraining and long-term tracking) and demonstrates strong 3D tracking performance, outperforming 2D long-term trackers and simple scene-flow chaining on synthetic benchmarks, while maintaining online operation without test-time optimization. The work advances 3D scene understanding with potential downstream impact in AR and robotics, and shows how multi-frame cost volumes and motion priors can be effectively integrated via cross-attention for robust long-term tracking.
Abstract
Long-term point tracking is essential to understand non-rigid motion in the physical world better. Deep learning approaches have recently been incorporated into long-term point tracking, but most prior work predominantly functions in 2D. Although these methods benefit from the well-established backbones and matching frameworks, the motions they produce do not always make sense in the 3D physical world. In this paper, we propose the first deep learning framework for long-term point tracking in 3D that generalizes to new points and videos without requiring test-time fine-tuning. Our model contains a cost volume fusion module that effectively integrates multiple past appearances and motion information via a transformer architecture, significantly enhancing overall tracking performance. In terms of 3D tracking performance, our model significantly outperforms simple scene flow chaining and previous 2D point tracking methods, even if one uses ground truth depth and camera pose to backproject 2D point tracks in a synthetic scenario.
