Table of Contents
Fetching ...

SceneTracker: Long-term Scene Flow Estimation Network

Bo Wang, Jian Li, Yang Yu, Li Liu, Zhenping Sun, Dewen Hu

TL;DR

This paper defines long-term scene flow estimation (LSFE) and introduces SceneTracker, the first LSFE network that iteratively estimates a target's 3D trajectory across a sequence by fusing appearance and depth residual features and leveraging Transformer-based long-range connections. It extends CoTracker with 3D trajectory regression, a depth-aware flow iteration scheme, and an unrolled training objective, enabling robust online tracking under occlusion and depth noise. The authors also construct LSFOdyssey, a synthetic LSFE dataset, and LSFDriving, a real-world dataset, to validate generalization from synthetic to real data; SceneTracker consistently outperforms SFE-based baselines and TAP baselines on 2D/3D metrics and shows strong real-world transfer. Overall, the work demonstrates the viability and value of long-term 3D motion tracking for robotics, autonomous driving, and AR/VR, and provides datasets and code to advance LSFE research.

Abstract

Considering that scene flow estimation has the capability of the spatial domain to focus but lacks the coherence of the temporal domain, this study proposes long-term scene flow estimation (LSFE), a comprehensive task that can simultaneously capture the fine-grained and long-term 3D motion in an online manner. We introduce SceneTracker, the first LSFE network that adopts an iterative approach to approximate the optimal 3D trajectory. The network dynamically and simultaneously indexes and constructs appearance correlation and depth residual features. Transformers are then employed to explore and utilize long-range connections within and between trajectories. With detailed experiments, SceneTracker shows superior capabilities in addressing 3D spatial occlusion and depth noise interference, highly tailored to the needs of the LSFE task. We build a real-world evaluation dataset, LSFDriving, for the LSFE field and use it in experiments to further demonstrate the advantage of SceneTracker in generalization abilities. The code and data are available at https://github.com/wwsource/SceneTracker.

SceneTracker: Long-term Scene Flow Estimation Network

TL;DR

This paper defines long-term scene flow estimation (LSFE) and introduces SceneTracker, the first LSFE network that iteratively estimates a target's 3D trajectory across a sequence by fusing appearance and depth residual features and leveraging Transformer-based long-range connections. It extends CoTracker with 3D trajectory regression, a depth-aware flow iteration scheme, and an unrolled training objective, enabling robust online tracking under occlusion and depth noise. The authors also construct LSFOdyssey, a synthetic LSFE dataset, and LSFDriving, a real-world dataset, to validate generalization from synthetic to real data; SceneTracker consistently outperforms SFE-based baselines and TAP baselines on 2D/3D metrics and shows strong real-world transfer. Overall, the work demonstrates the viability and value of long-term 3D motion tracking for robotics, autonomous driving, and AR/VR, and provides datasets and code to advance LSFE research.

Abstract

Considering that scene flow estimation has the capability of the spatial domain to focus but lacks the coherence of the temporal domain, this study proposes long-term scene flow estimation (LSFE), a comprehensive task that can simultaneously capture the fine-grained and long-term 3D motion in an online manner. We introduce SceneTracker, the first LSFE network that adopts an iterative approach to approximate the optimal 3D trajectory. The network dynamically and simultaneously indexes and constructs appearance correlation and depth residual features. Transformers are then employed to explore and utilize long-range connections within and between trajectories. With detailed experiments, SceneTracker shows superior capabilities in addressing 3D spatial occlusion and depth noise interference, highly tailored to the needs of the LSFE task. We build a real-world evaluation dataset, LSFDriving, for the LSFE field and use it in experiments to further demonstrate the advantage of SceneTracker in generalization abilities. The code and data are available at https://github.com/wwsource/SceneTracker.
Paper Structure (27 sections, 6 equations, 5 figures, 4 tables)

This paper contains 27 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Architecture of the proposed method.(a) LSFE process of a current sliding window. Using $S$ RGB-D frames and the initialized trajectories $P^{xyz}_{init}$ as inputs, the network estimates the 3D trajectory segments $P^{xyz}$. (b) Flow iteration module. Template feature $Q_n$ and 3D downscaled trajectories $P^{uvd}_n$ are iteratively updated. (c) Transformer Updater network. The input feature is enhanced by Transformer blocks that factorize the attention across time and space.
  • Figure 2: Visualizations of SceneTracker estimation results on the LSFOdyssey test dataset.
  • Figure 3: Examples of the proposed LSFDriving dataset.
  • Figure 4: Qualitative results of the TAP baseline, the SF baseline, and our SceneTracker on the LSFOdyssey test dataset. We visualize the trajectory estimates and ground truth of the final frame's point cloud. The trajectories are colorized using a jet colormap. The solid-box marked regions represent areas where the SF baseline exhibits significant errors due to occlusion or exceeding boundaries.
  • Figure 5: Metric $\text{EPE}_{3D}$ on the LSFOdyssey test dataset versus the number of FIM iterations after the Odyssey training process.