Table of Contents
Fetching ...

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Joëlle K. Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, Mehdi S. M. Sajjadi

TL;DR

This work addresses the challenge of reconstructing and tracking dynamic 4D scenes directly from video. It introduces D4RT, a unified feedforward model with a global scene encoder and a lightweight, query-based decoder that predicts 3D point positions for arbitrary space-time queries, enabling outputs such as depth maps, dense point clouds, and camera parameters. Dense, efficient reconstruction is achieved via independent queries and an occupancy-grid strategy for tracking all pixels, yielding linear scalability with the number of queried points. Empirically, D4RT sets new state-of-the-art across 4D reconstruction and tracking tasks, while delivering substantial speedups over prior methods and supporting robust performance on both static and dynamic scenes.

Abstract

Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

TL;DR

This work addresses the challenge of reconstructing and tracking dynamic 4D scenes directly from video. It introduces D4RT, a unified feedforward model with a global scene encoder and a lightweight, query-based decoder that predicts 3D point positions for arbitrary space-time queries, enabling outputs such as depth maps, dense point clouds, and camera parameters. Dense, efficient reconstruction is achieved via independent queries and an occupancy-grid strategy for tracking all pixels, yielding linear scalability with the number of queried points. Empirically, D4RT sets new state-of-the-art across 4D reconstruction and tracking tasks, while delivering substantial speedups over prior methods and supporting robust performance on both static and dynamic scenes.

Abstract

Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.

Paper Structure

This paper contains 23 sections, 5 equations, 13 figures, 11 tables, 1 algorithm.

Figures (13)

  • Figure 1: D4RT is a unified, efficient, feedforward method for Dynamic 4D Reconstruction and Tracking, unlocking a variety of outputs including point cloud (), point tracks (), camera parameters () through a single interface.
  • Figure 2: D4RT model overview -- A global self-attention encoder first transforms the input video into the latent Global Scene Representation$F$, which is passed to a lightweight decoder. The decoder can be independently queried for the 3D position$\mathbf{P}$ of any given 2D point ($u$, $v$) from the source timestep $t_\text{src}$ at target timestep $t_\text{tgt}$ in camera coordinate $t_\text{cam}$, unlocking full decoding at any point in space and time. The query also contains an embedding of the local video patch centered around ($u$, $v$), providing additional spatial context.
  • Figure 3: Pose accuracy vs. speed -- We compare pose accuracy vs. throughput against recent state-of-the-art methods. Pose accuracy is 1 -- error, averaged over ATE/RTE/RPE on Sintel and ScanNet. Throughput is measured in FPS on an A100 GPU. D4RT achieves 200+ FPS pose estimation, 9$\times$ faster than VGGT, and 100$\times$ faster than MegaSaM, while delivering superior accuracy.
  • Figure 4: Reconstruction results across methods -- Pure reconstruction methods (MegaSaM and $\pi^3$) are only able to accumulate point clouds of all pixels; exhibiting clear failure cases in dynamic scenes. For example, the swan is repeated in MegaSaM's reconstruction, and $\pi^3$ is failing entirely to reconstruct the flower. SpatialTrackerV2, a state-of-the-art tracking method, successfully captures dynamics, however its design only allows tracking points from one frame, leaving gaps in the reconstruction (behind the swan and train). D4RT is the only method that successfully reconstructs a full 4D representation of the scene including all pixels of the video.
  • Figure 5: Visualizations on in-the-wild videos -- D4RT demonstrates accurate reconstructions on static (top row) and dynamic scenes (bottom row). In the presence of motion, D4RT additionally produces robust 3D point trajectories.
  • ...and 8 more figures