Table of Contents
Fetching ...

Flow4R: Unifying 4D Reconstruction and Tracking with Scene Flow

Shenhan Qian, Ganlin Zhang, Shangzhe Wu, Daniel Cremers

TL;DR

Flow4R is proposed, a unified framework that treats camera-space scene flow as the central representation linking 3D structure, object motion, and camera motion, and predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer.

Abstract

Reconstructing and tracking dynamic 3D scenes remains a fundamental challenge in computer vision. Existing approaches often decouple geometry from motion: multi-view reconstruction methods assume static scenes, while dynamic tracking frameworks rely on explicit camera pose estimation or separate motion models. We propose Flow4R, a unified framework that treats camera-space scene flow as the central representation linking 3D structure, object motion, and camera motion. Flow4R predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer. This flow-centric formulation allows local geometry and bidirectional motion to be inferred symmetrically with a shared decoder in a single forward pass, without requiring explicit pose regressors or bundle adjustment. Trained jointly on static and dynamic datasets, Flow4R achieves state-of-the-art performance on 4D reconstruction and tracking tasks, demonstrating the effectiveness of the flow-central representation for spatiotemporal scene understanding.

Flow4R: Unifying 4D Reconstruction and Tracking with Scene Flow

TL;DR

Flow4R is proposed, a unified framework that treats camera-space scene flow as the central representation linking 3D structure, object motion, and camera motion, and predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer.

Abstract

Reconstructing and tracking dynamic 3D scenes remains a fundamental challenge in computer vision. Existing approaches often decouple geometry from motion: multi-view reconstruction methods assume static scenes, while dynamic tracking frameworks rely on explicit camera pose estimation or separate motion models. We propose Flow4R, a unified framework that treats camera-space scene flow as the central representation linking 3D structure, object motion, and camera motion. Flow4R predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer. This flow-centric formulation allows local geometry and bidirectional motion to be inferred symmetrically with a shared decoder in a single forward pass, without requiring explicit pose regressors or bundle adjustment. Trained jointly on static and dynamic datasets, Flow4R achieves state-of-the-art performance on 4D reconstruction and tracking tasks, demonstrating the effectiveness of the flow-central representation for spatiotemporal scene understanding.
Paper Structure (35 sections, 16 equations, 5 figures, 5 tables)

This paper contains 35 sections, 16 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Flow4R takes two images as input at a time and predicts a pixel-aligned property set, including point position $\mathbf{P}$, scene flow $\mathbf{F}$, pose weight $\mathbf{W}$, and confidence $\mathbf{C}$ (omitted in this figure), from which various downstream predictions can be deduced.
  • Figure 2: Sequence processing paradigms.
  • Figure 3: Qualitative Results. We visualize the 3D tracking trajectories by projecting them onto 2D. Ground-truth trajectories are marked with dots ($\bullet$), and predicted trajectories are denoted by a plus symbol ($+$). The results of Flow4R show less reprojection error on both the background and foreground.
  • Figure 4: 3D Visualization of the Predictions. Examples are taken from the DAVIS perazzi2016benchmark, Aria Digital Twin pan2023aria, and Point Odyssey zheng2023pointodyssey datasets. Our model is capable of reconstructing 3D scenes and tracking the motion of both the camera and objects.
  • Figure 5: 2D Visualizations of Flow4R Predictions (\ref{['sec:visualizations']}). Given an image pair, Flow4R predicts for each image the point position $\mathbf{P}$, scene flow $\mathbf{F}$, pose weight $\mathbf{W}$, and confidence $\mathbf{C}$. The point position map $\mathbf{P}$ captures scene geometry in the local space. The scene flow map $\mathbf{F}$ describes how each point moves from the current image to its pair, capturing both camera and object motions. The pose weight map $\mathbf{W}$ indicates which pixels are reliable for camera pose estimation. The confidence map $\mathbf{C}$ indicates the uncertainty of the predictions.