ShapeGen4D: Towards High Quality 4D Shape Generation from Videos
Jiraphon Yenphraphai, Ashkan Mirzaei, Jianqi Chen, Jiaxu Zou, Sergey Tulyakov, Raymond A. Yeh, Peter Wonka, Chaoyang Wang
TL;DR
ShapeGen4D addresses video-conditioned 4D shape generation by directly producing dynamic 3D meshes from monocular video in a feedforward manner. It combines temporally-aligned latents, a spatiotemporal diffusion transformer, and cross-frame noise sharing to enforce temporal coherence and accommodate topology changes without per-frame optimization. The approach, grounded in pretrained 3D generators, achieves improved geometric fidelity and rendering stability against strong baselines on Objaverse and Consistent4D datasets, with ablations verifying the importance of its design choices. A two-stage post-processing pipeline for global pose registration and topology-consistent texturization enables animatable, consistently textured 4D assets practical for downstream applications.
Abstract
Video-conditioned 4D shape generation aims to recover time-varying 3D geometry and view-consistent appearance directly from an input video. In this work, we introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video. Our framework introduces three key components based on large-scale pre-trained 3D models: (i) a temporal attention that conditions generation on all frames while producing a time-indexed dynamic representation; (ii) a time-aware point sampling and 4D latent anchoring that promote temporally consistent geometry and texture; and (iii) noise sharing across frames to enhance temporal stability. Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization. Across diverse in-the-wild videos, our method improves robustness and perceptual fidelity and reduces failure modes compared with the baselines.
