Table of Contents
Fetching ...

TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing

Eddie Pokming Sheung, Qihao Liu, Wufei Ma, Prakhar Kaushik, Jianwen Xie, Alan Yuille

TL;DR

TriDiff-4D tackles the challenge of fast, high-fidelity 4D avatar generation from text by introducing a diffusion-based triplane re-posing pipeline that explicitly separates 3D structure from motion. The method first creates a canonical 3D avatar via triplane diffusion, then generates a motion sequence from text, and finally re-poses the avatar through diffusion conditioned on a skeletal sequence, ensuring temporal coherence and geometric consistency. By learning 3D structure and motion priors and enabling skeleton-driven control, TriDiff-4D achieves real-time-like generation speeds (seconds per sequence) and outperforms prior 4D methods in both quality and speed, while supporting NeRF or Gaussian Splatting decoders. This approach reduces reliance on expensive optimization loops and mitigates artifacts such as view-inconsistencies and jelly-like wobbling, with broad implications for VR/AR, gaming, and digital twin applications.

Abstract

With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.

TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing

TL;DR

TriDiff-4D tackles the challenge of fast, high-fidelity 4D avatar generation from text by introducing a diffusion-based triplane re-posing pipeline that explicitly separates 3D structure from motion. The method first creates a canonical 3D avatar via triplane diffusion, then generates a motion sequence from text, and finally re-poses the avatar through diffusion conditioned on a skeletal sequence, ensuring temporal coherence and geometric consistency. By learning 3D structure and motion priors and enabling skeleton-driven control, TriDiff-4D achieves real-time-like generation speeds (seconds per sequence) and outperforms prior 4D methods in both quality and speed, while supporting NeRF or Gaussian Splatting decoders. This approach reduces reliance on expensive optimization loops and mitigates artifacts such as view-inconsistencies and jelly-like wobbling, with broad implications for VR/AR, gaming, and digital twin applications.

Abstract

With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.

Paper Structure

This paper contains 18 sections, 5 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: TriDiff-4D is a novel 4D generative pipeline that enables high-quality, controllable 4D avatar generation from text using diffusion-based triplane re-posing. By explicitly modeling 3D structure and motion priors within the diffusion model, learned from large-scale 3D and motion datasets, it produces anatomically accurate, motion-consistent, dynamic, and visually coherent avatars that generate 14 frames of 3D object sequences in just 36 seconds on a single H100 GPU.
  • Figure 2: Running motion comparison ren2024l4gm (left) and our method (right). The baseline exhibits unrealistic geometric stretching, particularly evident in limb elongation during dynamic movements, while our method maintains consistent proportional geometry and structural integrity throughout the motion sequence.
  • Figure 3: Method overview. Given a prompt, we generate a 4D mesh through a three-step process: (1) Triplane generation, which transforms text descriptions into 3D representations; (2) Skeleton generation, which generates 3D motion sequences from the text; and (3) Diffusion-based reposing, which integrates these components to produce animated 3D avatars with precise pose control.
  • Figure 4: Visualization of our triplane skeleton encoding approach. Each row visualizes the same pose projected onto three orthogonal planes: XY (top), XZ (middle), and YZ (bottom). For each projection, we display three feature channels, repeated to match the triplane feature dimensionality: occupancy maps, which highlight the structural presence of joints and bones, and index maps, where brightness variations represent normalized joint indices.
  • Figure 5: Complex motion generation with TriDiff-4D. Our model is capable of generating extreme pose transitions and complex motions while preserving consistent geometry and appearance.
  • ...and 6 more figures