Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion
Zhenglin Zhou, Fan Ma, Hehe Fan, Tat-Seng Chua
TL;DR
This work tackles data-efficient generation of 4D animatable head avatars from a single image by distilling from a pre-trained video diffusion model. It introduces SymGEN, an updatable dataset loop that co-evolves with the avatar to mitigate spatial and temporal inconsistencies, and a Progressive Learning strategy that decouples learning into Spatial Consistency Learning and Temporal Consistency Learning to move from simple to complex scenarios. The approach combines an animatable Gaussian head with 3D Gaussian Splatting and portrait video diffusion, achieving higher fidelity, better animation quality, and faster rendering compared with SDS-based baselines, while keeping data requirements minimal. The method demonstrates strong empirical gains, including improved CLIP alignment, robust performance across poses and expressions, and practical real-time rendering, signaling a data-efficient path to lifelike 4D avatars for AR/VR and entertainment applications.
Abstract
Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: https://github.com/ZhenglinZhou/Zero-1-to-A.
