Table of Contents
Fetching ...

CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention

Gaojie Lin, Jianwen Jiang, Chao Liang, Tianyun Zhong, Jiaqi Yang, Yanbo Zheng

TL;DR

This work tackles the challenge of one-stage, audio-driven talking body video generation by introducing CyberHost, a diffusion-based framework augmented with Region Attention Modules (RAM) and human-prior-guided conditions. RAM combines a spatio-temporal region latents bank with identity descriptors to enhance local details in hands and face, while body movement maps, hand clarity scores, and pose-aligned reference features stabilize motion and structure. The approach enables end-to-end, zero-shot generation from a single image and audio, and demonstrates superior performance over state-of-the-art methods across audio-driven and video-driven tasks, including open-set generalization. Extensive ablations validate the RAM design and priors, and the model supports multimodal control and open-set results, underscoring its practical potential and considerations for ethical deployment.

Abstract

Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. However, the majority of these studies are confined to same-modality driving settings, with cross-modality human body animation remaining relatively underexplored. In this paper, we introduce, an end-to-end audio-driven human animation framework that ensures hand integrity, identity consistency, and natural motion. The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results. To our knowledge, CyberHost is the first end-to-end audio-driven human diffusion model capable of facilitating zero-shot video generation within the scope of human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects.

CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention

TL;DR

This work tackles the challenge of one-stage, audio-driven talking body video generation by introducing CyberHost, a diffusion-based framework augmented with Region Attention Modules (RAM) and human-prior-guided conditions. RAM combines a spatio-temporal region latents bank with identity descriptors to enhance local details in hands and face, while body movement maps, hand clarity scores, and pose-aligned reference features stabilize motion and structure. The approach enables end-to-end, zero-shot generation from a single image and audio, and demonstrates superior performance over state-of-the-art methods across audio-driven and video-driven tasks, including open-set generalization. Extensive ablations validate the RAM design and priors, and the model supports multimodal control and open-set results, underscoring its practical potential and considerations for ethical deployment.

Abstract

Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. However, the majority of these studies are confined to same-modality driving settings, with cross-modality human body animation remaining relatively underexplored. In this paper, we introduce, an end-to-end audio-driven human animation framework that ensures hand integrity, identity consistency, and natural motion. The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results. To our knowledge, CyberHost is the first end-to-end audio-driven human diffusion model capable of facilitating zero-shot video generation within the scope of human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects.
Paper Structure (32 sections, 5 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 32 sections, 5 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Existing body animation methods struggle to generate detailed hand and facial results in both video-driven (V2V) and audio-driven (A2V) settings. In contrast, our approach ensures hand integrity and facial identity consistency. These differences are also illustrated with videos in the supplementary materials.
  • Figure 2: The overall structure of CyberHost. We aim to generate a video clip by driving a reference image based on an audio signal. Region attention modules (RAMs) are inserted at multiple stages of the denoising U-Net for fine-grained modeling of local regions. Additionally, Human-Prior-Guided Conditions, including the body movement map, hand clarity score and pose-aligned reference features are also introduced to reduce motion uncertainty. The reference network extracts motion cues from motion frames for temporal continuation.
  • Figure 3: An illustration of region attention module (RAM), using the hand region as an example.
  • Figure 4: The audio-driven taking body results of CyberHost compared to other methods.
  • Figure 5: Comparisons with other video-driven body reenactment results
  • ...and 9 more figures