CT4D: Consistent Text-to-4D Generation with Animatable Meshes

Ce Chen; Shaoli Huang; Xuelin Chen; Guangyi Chen; Xiaoguang Han; Kun Zhang; Mingming Gong

CT4D: Consistent Text-to-4D Generation with Animatable Meshes

Ce Chen, Shaoli Huang, Xuelin Chen, Guangyi Chen, Xiaoguang Han, Kun Zhang, Mingming Gong

TL;DR

CT4D presents a mesh-based approach to text-to-4D generation that directly yields animatable triangle meshes. The Generate-Refine-Animate pipeline, coupled with region-based driving and ARAP rigidity, achieves superior interframe consistency and geometry preservation over prior NeRF- and 3DGS-based methods. The explicit mesh representation enables texture editing and multi-object composition, expanding practical content-creation capabilities. Across quantitative metrics and user studies, CT4D outperforms baselines like AYG and 4D-fy in key quality dimensions, while also illustrating notable applications and transparent limitations. This work advances text-to-4D generation by integrating explicit geometry control with diffusion-based optimization, though future work should address joint-aware generation, dynamic object emergence, and standardized evaluation metrics.

Abstract

Text-to-4D generation has recently been demonstrated viable by integrating a 2D image diffusion model with a video diffusion model. However, existing models tend to produce results with inconsistent motions and geometric structures over time. To this end, we present a novel framework, coined CT4D, which directly operates on animatable meshes for generating consistent 4D content from arbitrary user-supplied prompts. The primary challenges of our mesh-based framework involve stably generating a mesh with details that align with the text prompt while directly driving it and maintaining surface continuity. Our CT4D framework incorporates a unique Generate-Refine-Animate (GRA) algorithm to enhance the creation of text-aligned meshes. To improve surface continuity, we divide a mesh into several smaller regions and implement a uniform driving function within each area. Additionally, we constrain the animating stage with a rigidity regulation to ensure cross-region continuity. Our experimental results, both qualitative and quantitative, demonstrate that our CT4D framework surpasses existing text-to-4D techniques in maintaining interframe consistency and preserving global geometry. Furthermore, we showcase that this enhanced representation inherently possesses the capability for combinational 4D generation and texture editing.

CT4D: Consistent Text-to-4D Generation with Animatable Meshes

TL;DR

Abstract

Paper Structure (37 sections, 10 equations, 15 figures, 1 table)

This paper contains 37 sections, 10 equations, 15 figures, 1 table.

Introduction
Related Work
Text-to-4D Generation.
Skeleton-free Mesh Deformation.
Method
4D Representation
Generate-Refine-Animate (GRA)
Generating.
Refining.
Animating.
Experiments
Evaluation Settings
Metrics.
Baselines and Prompts.
Implementation Details.
...and 22 more sections

Figures (15)

Figure 1: Generated samples from two text prompts viewed at different time steps and viewpoints. Video results are presented in the supplementary.
Figure 2: Limitations of existing text-to-4D methods. The text prompt is "superhero dog with red cape flying through the sky". (a) Front view outputs of 4D-fy with video SDS loss weights of 0.1 (1st row) and 1.0 (2nd row). The geometry and texture of the dog's ear degrade with larger weight due to not decouple the static and dynamic parts. (b) Front (1st row) and back (2nd row) view outputs of AYG. The front view looks fine, while the back view is distorted, with the cape appearing granulated.
Figure 3: CT4D Framework. We generate animatable meshes using the Generate-Refine-Animate (GRA) algorithm. First, multiview images are rendered with a NeRF, and the SDS loss is calculated using a multiview diffusion model, resulting in a coarse static 3D object aligned with the text prompt. Next, a coarse mesh is extracted from the NeRF and sequentially optimized with multiple diffusion models (multiview, normal-depth, and single-view) to refine geometry and texture, producing a high-resolution mesh. Before animating, the refined static mesh is converted to an animatable form through clustering and generating pseudo-skinning weights and handle points. In the third stage, the mesh is driven by translations and rotations of the handle points (HP) and rendered into a sequence of keyframes to calculate the video SDS loss, thereby adding motion. Rigidity regulation is applied periodically to preserve the geometry during the animating stage.
Figure 4: Visual Results of GRA Algorithm. The text prompt used is "an elf playing skateboard". (a) to (c) sequentially depict the image result, the geometry image rendered with the extracted mesh (without texture), and the final mesh image rendered with texture.
Figure 5: Comparison with Existing Methods. We compare our method with AYG and 4D-fy. To illustrate the geometry, we present the mesh rendering results from 4D-fy and our method in the 2nd and 4th rows, respectively. As the source code for AYG is not available, we cannot generate mesh rendering results for this method.
...and 10 more figures

CT4D: Consistent Text-to-4D Generation with Animatable Meshes

TL;DR

Abstract

CT4D: Consistent Text-to-4D Generation with Animatable Meshes

Authors

TL;DR

Abstract

Table of Contents

Figures (15)