Table of Contents
Fetching ...

Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion

Zhenglin Zhou, Fan Ma, Hehe Fan, Tat-Seng Chua

TL;DR

This work tackles data-efficient generation of 4D animatable head avatars from a single image by distilling from a pre-trained video diffusion model. It introduces SymGEN, an updatable dataset loop that co-evolves with the avatar to mitigate spatial and temporal inconsistencies, and a Progressive Learning strategy that decouples learning into Spatial Consistency Learning and Temporal Consistency Learning to move from simple to complex scenarios. The approach combines an animatable Gaussian head with 3D Gaussian Splatting and portrait video diffusion, achieving higher fidelity, better animation quality, and faster rendering compared with SDS-based baselines, while keeping data requirements minimal. The method demonstrates strong empirical gains, including improved CLIP alignment, robust performance across poses and expressions, and practical real-time rendering, signaling a data-efficient path to lifelike 4D avatars for AR/VR and entertainment applications.

Abstract

Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: https://github.com/ZhenglinZhou/Zero-1-to-A.

Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion

TL;DR

This work tackles data-efficient generation of 4D animatable head avatars from a single image by distilling from a pre-trained video diffusion model. It introduces SymGEN, an updatable dataset loop that co-evolves with the avatar to mitigate spatial and temporal inconsistencies, and a Progressive Learning strategy that decouples learning into Spatial Consistency Learning and Temporal Consistency Learning to move from simple to complex scenarios. The approach combines an animatable Gaussian head with 3D Gaussian Splatting and portrait video diffusion, achieving higher fidelity, better animation quality, and faster rendering compared with SDS-based baselines, while keeping data requirements minimal. The method demonstrates strong empirical gains, including improved CLIP alignment, robust performance across poses and expressions, and practical real-time rendering, signaling a data-efficient path to lifelike 4D avatars for AR/VR and entertainment applications.

Abstract

Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: https://github.com/ZhenglinZhou/Zero-1-to-A.

Paper Structure

This paper contains 21 sections, 2 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: 4D Avatar generation with SDS loss poole2022dreamfusion and our Zero-1-to-A. Video diffusion often suffers from spatial and temporal inconsistencies. (a) The SDS loss, aligning avatar with output from video diffusion, produces over-smooth results. (b) Zero-1-to-A addresses this issue by synthesizing spatially and temporally consistent datasets for avatar reconstruction. It introduces an updatable dataset to cache video diffusion results and establishes a mutually beneficial relationship between avatar generation and dataset construction to further enhance consistency.
  • Figure 2: Image to animatable avatars generation by Zero-1-to-A. Without manually annotated data, our method can distill high-fidelity 4D avatars with real-time rendering speed from a pre-trained video diffusion using only one image input.
  • Figure 3: Framework of SymGEN. Our method simultaneously builds both the dataset and avatar from scratch through video diffusion. It establishes a mutually beneficial relationship between dataset construction and avatar reconstruction, iteratively updating the synthesized dataset and training the head avatar on the updated dataset to achieve unified results.
  • Figure 4: Pipeline of Progressive Learning. It sequences learning from simple to complex, facilitating symbiotic generation to create consistent avatars from inconsistent video diffusion. This process divides 4D avatar generation into: (1) Spatial Consistency Learning: progressing from frontal to side views with a fixed expression. (2) Temporal Consistency Learning: learn from relaxed to hyperbole expressions under a fixed camera.
  • Figure 5: Comparison with Static Head Avatar Generation Methods. From top to bottom: Doctor Strange, Two-Face from DC, Vincent van Gogh, and Black Panther from Marvel. Guided by image prompts, our approach captures rich details and demonstrates superior performance in texture and geometry.
  • ...and 12 more figures