Table of Contents
Fetching ...

UNICA: A Unified Neural Framework for Controllable 3D Avatars

Jiahe Zhu, Xinyao Wang, Yiyu Zhuang, Yanwen Wang, Jing Tian, Yao Yao, Hao Zhu

Abstract

Controllable 3D human avatars have found widespread applications in 3D games, the metaverse, and AR/VR scenarios. The conventional approach to creating such a 3D avatar requires a lengthy, intricate pipeline encompassing appearance modeling, motion planning, rigging, and physical simulation. In this paper, we introduce UNICA (UNIfied neural Controllable Avatar), a skeleton-free generative model that unifies all avatar control components into a single neural framework. Given keyboard inputs akin to video game controls, UNICA generates the next frame of a 3D avatar's geometry through an action-conditioned diffusion model operating on 2D position maps. A point transformer then maps the resulting geometry to 3D Gaussian Splatting for high-fidelity free-view rendering. Our approach naturally captures hair and loose clothing dynamics without manually designed physical simulation, and supports extra-long autoregressive generation. To the best of our knowledge, UNICA is the first model to unify the workflow of "motion planning, rigging, physical simulation, and rendering". Code is released at https://github.com/zjh21/UNICA.

UNICA: A Unified Neural Framework for Controllable 3D Avatars

Abstract

Controllable 3D human avatars have found widespread applications in 3D games, the metaverse, and AR/VR scenarios. The conventional approach to creating such a 3D avatar requires a lengthy, intricate pipeline encompassing appearance modeling, motion planning, rigging, and physical simulation. In this paper, we introduce UNICA (UNIfied neural Controllable Avatar), a skeleton-free generative model that unifies all avatar control components into a single neural framework. Given keyboard inputs akin to video game controls, UNICA generates the next frame of a 3D avatar's geometry through an action-conditioned diffusion model operating on 2D position maps. A point transformer then maps the resulting geometry to 3D Gaussian Splatting for high-fidelity free-view rendering. Our approach naturally captures hair and loose clothing dynamics without manually designed physical simulation, and supports extra-long autoregressive generation. To the best of our knowledge, UNICA is the first model to unify the workflow of "motion planning, rigging, physical simulation, and rendering". Code is released at https://github.com/zjh21/UNICA.

Paper Structure

This paper contains 62 sections, 13 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: UNICA is a unified model that generates action-controlled, $360^\circ$-renderable 3D avatars with dynamics. For the first time, UNICA unifies a workflow of "motion planning, rigging, physical simulation, and rendering" within a single model.
  • Figure 2: The pipeline of UNICA. UNICA consists of an action-conditioned multi-frame diffusion model for avatar geometry and a point transformer for point-to-3DGS appearance mapping. The diffusion model takes latents of three position maps as context and generates one subsequent position map conditioned on a chosen action embedding. The generated position map is upscaled using bilinear interpolation and combined with other base 3DGS attributes to form a coarse 3DGS. The point transformer then refines the coarse 3DGS into one with avatar-specific appearance details.
  • Figure 3: The position map rendering process and visualization of a four-frame group. (a) We use an A-Pose mesh of the avatar as geometry and the vertex coordinates of the posed avatar as vertex colors to render position maps. The position maps are rendered from six orthogonal views. (b) We partition the motion sequence into groups of four frames and normalize each group.
  • Figure 4: Demonstration of progressive 4D inference. During autoregressive inference of UNICA, each round generates a relative movement that is accumulated in 3D space for the actual movement of the 3D avatar. The output frame of round $n$ will be renormalized before it is used as input for round $n+1$.
  • Figure 5: Animation results of UNICA demonstrating avatar response to key presses. For visualization clarity, we sample one frame every three frames along the trajectory.
  • ...and 16 more figures