Table of Contents
Fetching ...

Learning Coherent Matrixized Representation in Latent Space for Volumetric 4D Generation

Qitong Yang, Mingtao Feng, Zijie Wu, Shijie Sun, Weisheng Dong, Yaonan Wang, Ajmal Mian

TL;DR

This work presents a novel framework for volumetric 4D sequence generation that combines coherent 3D shape and color modeling with a matrixized latent representation and spatio-temporal diffusion conditioned on image and text. The matrixized latent space enables efficient learning and enables variable-length sequences via interpolation, while HCSTA ensures shape-color coherence and temporal consistency without reliance on pose priors. The approach demonstrates superior performance on multiple datasets for both unconditional and conditional generation, offering high-fidelity geometry, color, and motion with efficient inference. The results suggest practical potential for editable, view-consistent 4D content in graphics and vision applications, though real-world data limitations remain a challenge.

Abstract

Directly learning to model 4D content, including shape, color, and motion, is challenging. Existing methods rely on pose priors for motion control, resulting in limited motion diversity and continuity in details. To address this, we propose a framework that generates volumetric 4D sequences, where 3D shapes are animated under given conditions (text-image guidance) with dynamic evolution in shape and color across spatial and temporal dimensions, allowing for free navigation and rendering from any direction. We first use a coherent 3D shape and color modeling to encode the shape and color of each detailed 3D geometry frame into a latent space. Then we propose a matrixized 4D sequence representation allowing efficient diffusion model operation. Finally, we introduce spatio-temporal diffusion for 4D volumetric generation under given images and text prompts. Extensive experiments on the ShapeNet, 3DBiCar, DeformingThings4D and Objaverse datasets for several tasks demonstrate that our method effectively learns to generate high quality 3D shapes with consistent color and coherent mesh animations, improving over the current methods. Our code will be publicly available.

Learning Coherent Matrixized Representation in Latent Space for Volumetric 4D Generation

TL;DR

This work presents a novel framework for volumetric 4D sequence generation that combines coherent 3D shape and color modeling with a matrixized latent representation and spatio-temporal diffusion conditioned on image and text. The matrixized latent space enables efficient learning and enables variable-length sequences via interpolation, while HCSTA ensures shape-color coherence and temporal consistency without reliance on pose priors. The approach demonstrates superior performance on multiple datasets for both unconditional and conditional generation, offering high-fidelity geometry, color, and motion with efficient inference. The results suggest practical potential for editable, view-consistent 4D content in graphics and vision applications, though real-world data limitations remain a challenge.

Abstract

Directly learning to model 4D content, including shape, color, and motion, is challenging. Existing methods rely on pose priors for motion control, resulting in limited motion diversity and continuity in details. To address this, we propose a framework that generates volumetric 4D sequences, where 3D shapes are animated under given conditions (text-image guidance) with dynamic evolution in shape and color across spatial and temporal dimensions, allowing for free navigation and rendering from any direction. We first use a coherent 3D shape and color modeling to encode the shape and color of each detailed 3D geometry frame into a latent space. Then we propose a matrixized 4D sequence representation allowing efficient diffusion model operation. Finally, we introduce spatio-temporal diffusion for 4D volumetric generation under given images and text prompts. Extensive experiments on the ShapeNet, 3DBiCar, DeformingThings4D and Objaverse datasets for several tasks demonstrate that our method effectively learns to generate high quality 3D shapes with consistent color and coherent mesh animations, improving over the current methods. Our code will be publicly available.
Paper Structure (11 sections, 21 equations, 15 figures, 8 tables)

This paper contains 11 sections, 21 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Proposed image-text conditioned 4D generation with high 3D shape quality, color fidelity and sequence coherence, each frame enables free navigation and rendering from any direction.
  • Figure 1: Ablation of hyperparameter.
  • Figure 2: Method overview. The shape and color latent vectors for full sequences are jointly concatenated into a matrixized 4D latent representation $\mathcal{M}$. The input masked image and text are encoded via CLIP CLIP to condition the diffusion process of $\mathcal{M}$. The volumetric 4D sequences are then reconstructed from the generated$\mathcal{M}'$, with latent frame interpolation enabling variable-length sequence generation.
  • Figure 3: Pipeline for learning geometric coherent color representation. During each training epoch, a subset of vertices is randomly selected.
  • Figure 4: Hierarchical Conditional Spatio-Temporal Attention (HCSTA) block, repeatedly applying to get the final denoised $\mathcal{M}$. Within each layer, different colored latents represent the dynamics of distinct local regions, while the same colored latents represent the dynamics of a local region at different time steps.
  • ...and 10 more figures