Table of Contents
Fetching ...

ShapeGen4D: Towards High Quality 4D Shape Generation from Videos

Jiraphon Yenphraphai, Ashkan Mirzaei, Jianqi Chen, Jiaxu Zou, Sergey Tulyakov, Raymond A. Yeh, Peter Wonka, Chaoyang Wang

TL;DR

ShapeGen4D addresses video-conditioned 4D shape generation by directly producing dynamic 3D meshes from monocular video in a feedforward manner. It combines temporally-aligned latents, a spatiotemporal diffusion transformer, and cross-frame noise sharing to enforce temporal coherence and accommodate topology changes without per-frame optimization. The approach, grounded in pretrained 3D generators, achieves improved geometric fidelity and rendering stability against strong baselines on Objaverse and Consistent4D datasets, with ablations verifying the importance of its design choices. A two-stage post-processing pipeline for global pose registration and topology-consistent texturization enables animatable, consistently textured 4D assets practical for downstream applications.

Abstract

Video-conditioned 4D shape generation aims to recover time-varying 3D geometry and view-consistent appearance directly from an input video. In this work, we introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video. Our framework introduces three key components based on large-scale pre-trained 3D models: (i) a temporal attention that conditions generation on all frames while producing a time-indexed dynamic representation; (ii) a time-aware point sampling and 4D latent anchoring that promote temporally consistent geometry and texture; and (iii) noise sharing across frames to enhance temporal stability. Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization. Across diverse in-the-wild videos, our method improves robustness and perceptual fidelity and reduces failure modes compared with the baselines.

ShapeGen4D: Towards High Quality 4D Shape Generation from Videos

TL;DR

ShapeGen4D addresses video-conditioned 4D shape generation by directly producing dynamic 3D meshes from monocular video in a feedforward manner. It combines temporally-aligned latents, a spatiotemporal diffusion transformer, and cross-frame noise sharing to enforce temporal coherence and accommodate topology changes without per-frame optimization. The approach, grounded in pretrained 3D generators, achieves improved geometric fidelity and rendering stability against strong baselines on Objaverse and Consistent4D datasets, with ablations verifying the importance of its design choices. A two-stage post-processing pipeline for global pose registration and topology-consistent texturization enables animatable, consistently textured 4D assets practical for downstream applications.

Abstract

Video-conditioned 4D shape generation aims to recover time-varying 3D geometry and view-consistent appearance directly from an input video. In this work, we introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video. Our framework introduces three key components based on large-scale pre-trained 3D models: (i) a temporal attention that conditions generation on all frames while producing a time-indexed dynamic representation; (ii) a time-aware point sampling and 4D latent anchoring that promote temporally consistent geometry and texture; and (iii) noise sharing across frames to enhance temporal stability. Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization. Across diverse in-the-wild videos, our method improves robustness and perceptual fidelity and reduces failure modes compared with the baselines.

Paper Structure

This paper contains 14 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: ShapeGen4D generates high-quality mesh sequences from input monocular videos.
  • Figure 2: ShapeGen4D employs a flow-based latent diffusion transformer to generate a sequence of meshes from an input video. (a) A 3D VAE encodes shapes into latents by cross-attending subsampled query points with a dense point cloud. To encode a sequence of animated assets, query points are subsampled from the first-frame point cloud and then propagated through the animation to obtain query points for subsequent frames—yielding temporally-aligned latents. The decoder maps these latents to signed distance fields, which are then converted into meshes via marching cubes. (b) The spatiotemporal diffusion transformer interleaves frozen dual/single-stream transformer blocks from the base 3D generative model, which process hidden states for each frame independently, with learnable spatiotemporal attention layers that capture cross-frame dependencies and enforce temporal consistency in the denoised latents.
  • Figure 3: Illustration of latents with and without aligning query points across frames in (a) and (b). In (c), we visualize the average normalized $L_2$ difference between latents at the closest 3D positions across neighboring frames. We observe that with alignment, the $L_2$ difference is smaller, indicating that the latents are more consistent, i.e . , less jittery compared to non-aligned latents.
  • Figure 4: Qualitative comparison of noise sharing. The base 3D model generates object shapes in arbitrary orientations agnostic to the input image viewpoint, often causing pose changes across a sequence (e.g . the hippo in the first row). We observe that sharing noise across frames reduces flickering and further improves shape quality in challenging cases such as the flag example.
  • Figure 5: Qualitative comparison with baselines on the held-out Objaverse test set.
  • ...and 1 more figures