Table of Contents
Fetching ...

ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation

Fan Yang, Heyuan Li, Peihao Li, Weihao Yuan, Lingteng Qiu, Chaoyue Song, Cheng Chen, Yisheng He, Shifeng Zhang, Xiaoguang Han, Steven Hoi, Guosheng Lin

TL;DR

ViSA addresses the challenge of real-time photoreal upper-body avatar generation from a single image by integrating explicit 3D reconstruction with a real-time autoregressive video shader conditioned on 3D-aware features. A reconstruction module provides stable identity, pose, and texture priors, while a distilled video diffusion renderer delivers high-frequency details and fluid motion; an adversarial distribution preservation loss and progressive training ensure sharp, realistic outputs. The approach achieves superior visual quality and temporal coherence at real-time speeds, outperforming state-of-the-art one-shot and video-based avatars in both metrics and qualitative results. This work enables practical applications in gaming and virtual reality by delivering fast, reliable, and identity-preserving avatars from a single reference image.

Abstract

Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: https://lhyfst.github.io/visa

ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation

TL;DR

ViSA addresses the challenge of real-time photoreal upper-body avatar generation from a single image by integrating explicit 3D reconstruction with a real-time autoregressive video shader conditioned on 3D-aware features. A reconstruction module provides stable identity, pose, and texture priors, while a distilled video diffusion renderer delivers high-frequency details and fluid motion; an adversarial distribution preservation loss and progressive training ensure sharp, realistic outputs. The approach achieves superior visual quality and temporal coherence at real-time speeds, outperforming state-of-the-art one-shot and video-based avatars in both metrics and qualitative results. This work enables practical applications in gaming and virtual reality by delivering fast, reliable, and identity-preserving avatars from a single reference image.

Abstract

Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: https://lhyfst.github.io/visa

Paper Structure

This paper contains 19 sections, 10 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Photorealistic, consistent, and controllable character animation from a single reference image. Our method enables photorealistic upper-body avatar generation that preserves appearance fidelity across diverse poses and expressions while maintaining temporal coherence in real-time video synthesis. Input images are synthesized using Gemini team2023gemini.
  • Figure 2: Overview of the proposed ViSA. In the first stage, we train a one-shot, feed-forward transformer to regress a 3D Gaussian avatar in canonical space, conditioned on geometric, semantic, and low-level embeddings. In the second stage, we employ an autoregressive video model as a video renderer, conditioned on the 3D-aware features from stage one, to generate photorealistic results in real time.
  • Figure 3: The architecture of our discriminator. We adopt pretrained WAN video diffusion model as our backbone and add trainable classification head on it to predict logit. We also involve the feature from the reference image to enhancing the training of conditional generation.
  • Figure 4: The effects of our neck remove strategy. This simple yet effective strategy eliminates neck artifacts, yielding a coherent, natural-looking full-body template for training. The red boxes highlight the visual difference.
  • Figure 5: Qualitative comparisons on self-reenactment. Compared with existing approaches, our method generates more accurate and clearer results.
  • ...and 3 more figures