Table of Contents
Fetching ...

Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction

Edgar Sucar, Zihang Lai, Eldar Insafutdinov, Andrea Vedaldi

TL;DR

Dynamic Point Maps (DPM) extend DUSt3R by introducing time as an additional reference dimension, yielding invariance to both viewpoint and scene motion. The method predicts, for each image, two time-stamped point maps (one per timestamp) expressed in the first camera's frame, enabling immediate 4D reductions such as scene flow, motion segmentation, and object tracking within a single network. Trained on a mix of synthetic and real data across seven datasets, DPM demonstrates state-of-the-art or competitive performance in depth prediction, dynamic reconstruction, and scene/object flow, while maintaining a compact, end-to-end architecture. This work lays the groundwork for dynamic 3D foundation models by providing a unified, scalable representation that handles both spatial and temporal variations and simplifies downstream 4D reasoning.

Abstract

DUSt3R has recently shown that one can reduce many tasks in multi-view geometry, including estimating camera intrinsics and extrinsics, reconstructing the scene in 3D, and establishing image correspondences, to the prediction of a pair of viewpoint-invariant point maps, i.e., pixel-aligned point clouds defined in a common reference frame. This formulation is elegant and powerful, but unable to tackle dynamic scenes. To address this challenge, we introduce the concept of Dynamic Point Maps (DPM), extending standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. Our key intuition is that, when time is introduced, there are several possible spatial and time references that can be used to define the point maps. We identify a minimal subset of such combinations that can be regressed by a network to solve the sub tasks mentioned above. We train a DPM predictor on a mixture of synthetic and real data and evaluate it across diverse benchmarks for video depth prediction, dynamic point cloud reconstruction, 3D scene flow and object pose tracking, achieving state-of-the-art performance. Code, models and additional results are available at https://www.robots.ox.ac.uk/~vgg/research/dynamic-point-maps/.

Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction

TL;DR

Dynamic Point Maps (DPM) extend DUSt3R by introducing time as an additional reference dimension, yielding invariance to both viewpoint and scene motion. The method predicts, for each image, two time-stamped point maps (one per timestamp) expressed in the first camera's frame, enabling immediate 4D reductions such as scene flow, motion segmentation, and object tracking within a single network. Trained on a mix of synthetic and real data across seven datasets, DPM demonstrates state-of-the-art or competitive performance in depth prediction, dynamic reconstruction, and scene/object flow, while maintaining a compact, end-to-end architecture. This work lays the groundwork for dynamic 3D foundation models by providing a unified, scalable representation that handles both spatial and temporal variations and simplifies downstream 4D reasoning.

Abstract

DUSt3R has recently shown that one can reduce many tasks in multi-view geometry, including estimating camera intrinsics and extrinsics, reconstructing the scene in 3D, and establishing image correspondences, to the prediction of a pair of viewpoint-invariant point maps, i.e., pixel-aligned point clouds defined in a common reference frame. This formulation is elegant and powerful, but unable to tackle dynamic scenes. To address this challenge, we introduce the concept of Dynamic Point Maps (DPM), extending standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. Our key intuition is that, when time is introduced, there are several possible spatial and time references that can be used to define the point maps. We identify a minimal subset of such combinations that can be regressed by a network to solve the sub tasks mentioned above. We train a DPM predictor on a mixture of synthetic and real data and evaluate it across diverse benchmarks for video depth prediction, dynamic point cloud reconstruction, 3D scene flow and object pose tracking, achieving state-of-the-art performance. Code, models and additional results are available at https://www.robots.ox.ac.uk/~vgg/research/dynamic-point-maps/.

Paper Structure

This paper contains 38 sections, 13 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: We introduce the concept of Dynamic Point Maps (DPM). Differently from previous extension of point maps to dynamics, DPMs are time invariant in addition to be viewpoint invariant. Because of this, by predicting DPMs from a pair of images in a feed-forward manner we can easily solve key 3D tasks, such as recovering the camera parameters and reconstructing shape, as well 4D ones, such as estimating scene flow and 3D object tracking.
  • Figure 2: We propose to extend DUSt3R to predict Dynamic Point Maps (DPM). Each image in the pair is mapped to two point maps that correspond to the timestamps of the two images (pairs share the same colour in the figure). All points map are defined in the reference frame of image $I_1$, undoing the effect of viewpoint change. Scene flow and space-time correspondences can be inferred immediately.
  • Figure 3: Left: Standard point maps applied to dynamic scenes as in MonST3R zhang24monst3r: fail to represent dynamics. The cylinder, which is moving downwards, breaks invariance when the point maps are overlaid. Right: Our Dynamic Point Maps correctly represent dynamics by also controlling time in addition to viewpoint. They allow to restore invariance while still representing the motion of the cylinder.
  • Figure 4: Left: A schematic representation of the point cloud predictors $P_i(t_j,\pi_1)$ extracted from four images $I_1,\dots,I_4$ color coded as before. Each circle represent an image, visualized as a point on a fictitious viewpoint-time plane. Each arrow corresponds to a point map, the base of which is the source image and the tip of which is the reference viewpoint and time. Right: Visualization of the eight predicted point maps. From bottom to top, we reconstruct the animated point cloud in the reference frame of the first image $\pi_1$. In the bottom-right corner, each image contributes an invariant point map by undoing both the viewpoint and time changes; these point maps can then be fused.
  • Figure 5: Motion segmentation: from a pair of images using the dynamic point cloud predicted by the network we can segment out the dynamic elements of the scene, despite the camera motion.
  • ...and 4 more figures