Table of Contents
Fetching ...

Inferring Compositional 4D Scenes without Ever Seeing One

Ahmet Berke Gokmen, Ajad Chhatkuli, Luc Van Gool, Danda Pani Paudel

TL;DR

COM4D tackles the challenge of reconstructing compositional 4D scenes from monocular video without explicit 4D supervision by decoupling spatial and temporal reasoning into Attention Parsing and then fusing them at inference with Attention Mixing. A single Diffusion Transformer is trained on two data sources to learn static object composition and dynamic object dynamics, augmented by Diffusion Forcing for temporal coherence. The framework achieves state-of-the-art results in both compositional 4D reconstruction and 4D single-object reconstruction, while delivering strong 3D scene reconstruction performance, all without test-time optimization. Limitations include lack of explicit physical causality and restriction to fixed-camera setups, with potential extensions to dynamic cameras and occlusion-aware physics.

Abstract

Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.

Inferring Compositional 4D Scenes without Ever Seeing One

TL;DR

COM4D tackles the challenge of reconstructing compositional 4D scenes from monocular video without explicit 4D supervision by decoupling spatial and temporal reasoning into Attention Parsing and then fusing them at inference with Attention Mixing. A single Diffusion Transformer is trained on two data sources to learn static object composition and dynamic object dynamics, augmented by Diffusion Forcing for temporal coherence. The framework achieves state-of-the-art results in both compositional 4D reconstruction and 4D single-object reconstruction, while delivering strong 3D scene reconstruction performance, all without test-time optimization. Limitations include lack of explicit physical causality and restriction to fixed-camera setups, with potential extensions to dynamic cameras and occlusion-aware physics.

Abstract

Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.

Paper Structure

This paper contains 31 sections, 5 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Given a single video (bottom), our method reconstructs the entire 3D scene along with the individual dynamic objects (A), while maintaining spatial and temporal consistency through spatio-temporal attention mixing (C). The silhouettes (purple for human and orange for dog) correspond to the beginning of dynamic sequences. Our user study (for spatial correctness and temporal coherence) shows that the reconstructions obtained using the proposed attention mixing mechanism are clearly preferred over the baseline without mixing (B).
  • Figure 2: Our attention parsing and mixing strategy. A single DiT model with shared weights is trained jointly on two datasets. (Top) During training with samples from DeformingThings li20214dcomplete, odd-indexed blocks perform multi-frame attention to capture temporal dynamics. (Bottom) When training with samples from 3D-FRONT fu20213dfront3dfurnishedrooms, even-indexed blocks perform multi-instance attention to model spatial part decomposition. At inference, the same model applies an attention mixing mechanism. In each layer, spatial blocks (even-indexed) aggregate all latents from a single frame and process them jointly, conditioned on the full-scene image $y$ at that timestep. Temporal blocks (odd-indexed) then operate over all frames of each dynamic object separately, conditioned on their corresponding masked images. Masks are extracted from the video for each dynamic object using SAM sam, enabling temporally consistent object-specific reasoning.
  • Figure 3: Qualitative results on temporal sequences. Top rows show input frames; bottom rows show our generated reconstructions from two vertically stacked camera views. The examples shown are from content produced by ChatGPT chatgpt and animated with Wan wan, the CMU Panoptic dataset panoptic (sequences 160401_ian3 and 160906_ian2), and the PROX dataset prox (N3OpenArea_00158_02). Our method maintains temporal consistency and spatial realism across both real and synthetic sources.
  • Figure 4: Visualizations with and without our Attention Mixing strategy. Results are for 160401_ian3 at frame 1180 (starting frame: 1100) and 170915_office1 at frame 670 (starting frame: 590). Gray points denote ground truth.
  • Figure 5: Qualitative 4D generation comparisons for two subjects (top two rows: Ninja, bottom two rows: Amy) at two time steps. The first column shows the input frames, and subsequent columns show a fixed pose rendered view from each method. V2M4 fails in a few samples, e.g., the last row input.
  • ...and 12 more figures