ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation
Fan Yang, Heyuan Li, Peihao Li, Weihao Yuan, Lingteng Qiu, Chaoyue Song, Cheng Chen, Yisheng He, Shifeng Zhang, Xiaoguang Han, Steven Hoi, Guosheng Lin
TL;DR
ViSA addresses the challenge of real-time photoreal upper-body avatar generation from a single image by integrating explicit 3D reconstruction with a real-time autoregressive video shader conditioned on 3D-aware features. A reconstruction module provides stable identity, pose, and texture priors, while a distilled video diffusion renderer delivers high-frequency details and fluid motion; an adversarial distribution preservation loss and progressive training ensure sharp, realistic outputs. The approach achieves superior visual quality and temporal coherence at real-time speeds, outperforming state-of-the-art one-shot and video-based avatars in both metrics and qualitative results. This work enables practical applications in gaming and virtual reality by delivering fast, reliable, and identity-preserving avatars from a single reference image.
Abstract
Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: https://lhyfst.github.io/visa
