Table of Contents
Fetching ...

4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction

Kirill Mazur, Marwan Taher, Andrew J. Davison

TL;DR

4D Primitive-Mâché addresses persistent 4D reconstruction from casual monocular video by decomposing scenes into moving rigid primitives and gluing their trajectories over time via dense 2D-3D correspondences. The method introduces a primitive-based motion parameterization using SE(3) poses, a front-end for geometry, segmentation, and correspondences, and a back-end Gauss-Newton optimization with time remapping to produce complete 4D reconstructions. It demonstrates superior accuracy and completeness on object-scanning and multi-object datasets, and showcases object permanence by inferring motion of occluded elements. This approach reduces dynamic reconstruction dimensionality while enabling replayable, temporally-consistent 4D scenes suitable for robotics and AR applications.

Abstract

We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps. Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity. The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.

4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction

TL;DR

4D Primitive-Mâché addresses persistent 4D reconstruction from casual monocular video by decomposing scenes into moving rigid primitives and gluing their trajectories over time via dense 2D-3D correspondences. The method introduces a primitive-based motion parameterization using SE(3) poses, a front-end for geometry, segmentation, and correspondences, and a back-end Gauss-Newton optimization with time remapping to produce complete 4D reconstructions. It demonstrates superior accuracy and completeness on object-scanning and multi-object datasets, and showcases object permanence by inferring motion of occluded elements. This approach reduces dynamic reconstruction dimensionality while enabling replayable, temporally-consistent 4D scenes suitable for robotics and AR applications.

Abstract

We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps. Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity. The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.

Paper Structure

This paper contains 31 sections, 13 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Our method (4DPM) takes in casual monocular videos (captured by an iPhone) and outputs complete 3D scene reconstructions at every observed timestamp, using all scene observations. The method takes in the outputs of a feedforward reconstruction model (top row) and glues dynamic geometry observations across time (middle row). This results in a complete and accurate geometric reconstruction, which re-uses observations from all timestamps (bottom row).
  • Figure 2: 4D reconstruction with 4DPM.(left) Our frontend takes in a monocular RGB video and splits it into a set of 3D primitives. Each primitive is represented as a 3D point map in the world coordinate space, cut out by a segmentation mask. These primitives are matched across time (visualised with consistent colours) to form consistent entities across time, to which we refer as objects. (top right) Given geometric observations positioned at their respective timestamps, we "glue" primitives belonging to the same object across time according to their estimated dense 2D correspondences. (bottom right) The resulting complete reconstruction can be replayed across all observed timestamps.
  • Figure 3: Static vs dynamic segmentation. We visualise all estimated primitives on the left. Before motion estimation, we freeze primitives with insufficiently high correspondence residuals, assuming they are static. On the right, only dynamic primitives are shown. Our system produces motion segmentation as a by-product.
  • Figure 4: Qualitative comparison on Multi-Object dataset. Input video frames are shown in purple. Below each video, we visualise all observed point-maps time-warped to the latest timestamp. Our system successfully handles multi-object motion and performs well on particularly challenging objects such as the spinning ball and robot gripper (top row). We provide a top-down view of multiple objects spinning on a rotating base (bottom row). Our method correctly aggregates all observations, resulting in complete and accurate object scans.
  • Figure 5: Object permanence capabilities. In (top row) we show input frames of a closing the drawer sequence. The resulting reconstruction estimated with 4DPM from the top-down view in the (bottom row). When the drawer is fully closed (rightmost column), our method still reconstructs objects inside the drawer and the drawer body, despite it being completely occluded. This showcases object permanence capabilities of 4DPM. The top of the drawer is removed from reconstruction for better viewing.