CT4D: Consistent Text-to-4D Generation with Animatable Meshes
Ce Chen, Shaoli Huang, Xuelin Chen, Guangyi Chen, Xiaoguang Han, Kun Zhang, Mingming Gong
TL;DR
CT4D presents a mesh-based approach to text-to-4D generation that directly yields animatable triangle meshes. The Generate-Refine-Animate pipeline, coupled with region-based driving and ARAP rigidity, achieves superior interframe consistency and geometry preservation over prior NeRF- and 3DGS-based methods. The explicit mesh representation enables texture editing and multi-object composition, expanding practical content-creation capabilities. Across quantitative metrics and user studies, CT4D outperforms baselines like AYG and 4D-fy in key quality dimensions, while also illustrating notable applications and transparent limitations. This work advances text-to-4D generation by integrating explicit geometry control with diffusion-based optimization, though future work should address joint-aware generation, dynamic object emergence, and standardized evaluation metrics.
Abstract
Text-to-4D generation has recently been demonstrated viable by integrating a 2D image diffusion model with a video diffusion model. However, existing models tend to produce results with inconsistent motions and geometric structures over time. To this end, we present a novel framework, coined CT4D, which directly operates on animatable meshes for generating consistent 4D content from arbitrary user-supplied prompts. The primary challenges of our mesh-based framework involve stably generating a mesh with details that align with the text prompt while directly driving it and maintaining surface continuity. Our CT4D framework incorporates a unique Generate-Refine-Animate (GRA) algorithm to enhance the creation of text-aligned meshes. To improve surface continuity, we divide a mesh into several smaller regions and implement a uniform driving function within each area. Additionally, we constrain the animating stage with a rigidity regulation to ensure cross-region continuity. Our experimental results, both qualitative and quantitative, demonstrate that our CT4D framework surpasses existing text-to-4D techniques in maintaining interframe consistency and preserving global geometry. Furthermore, we showcase that this enhanced representation inherently possesses the capability for combinational 4D generation and texture editing.
