Table of Contents
Fetching ...

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Xiang Deng, Youxin Pang, Xiaochen Zhao, Chao Xu, Lizhen Wang, Hongjiang Xiao, Shi Yan, Hongwen Zhang, Yebin Liu

Abstract

This paper introduces Stereo-Talker, a novel one-shot audio-driven human video synthesis system that generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control. The process follows a two-stage approach. In the first stage, the system maps audio input to high-fidelity motion sequences, encompassing upper-body gestures and facial expressions. To enrich motion diversity and authenticity, large language model (LLM) priors are integrated with text-aligned semantic audio features, leveraging LLMs' cross-modal generalization power to enhance motion quality. In the second stage, we improve diffusion-based video generation models by incorporating a prior-guided Mixture-of-Experts (MoE) mechanism: a view-guided MoE focuses on view-specific attributes, while a mask-guided MoE enhances region-based rendering stability. Additionally, a mask prediction module is devised to derive human masks from motion data, enhancing the stability and accuracy of masks and enabling mask guiding during inference. We also introduce a comprehensive human video dataset with 2,203 identities, covering diverse body gestures and detailed annotations, facilitating broad generalization. The code, data, and pre-trained models will be released for research purposes.

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Abstract

This paper introduces Stereo-Talker, a novel one-shot audio-driven human video synthesis system that generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control. The process follows a two-stage approach. In the first stage, the system maps audio input to high-fidelity motion sequences, encompassing upper-body gestures and facial expressions. To enrich motion diversity and authenticity, large language model (LLM) priors are integrated with text-aligned semantic audio features, leveraging LLMs' cross-modal generalization power to enhance motion quality. In the second stage, we improve diffusion-based video generation models by incorporating a prior-guided Mixture-of-Experts (MoE) mechanism: a view-guided MoE focuses on view-specific attributes, while a mask-guided MoE enhances region-based rendering stability. Additionally, a mask prediction module is devised to derive human masks from motion data, enhancing the stability and accuracy of masks and enabling mask guiding during inference. We also introduce a comprehensive human video dataset with 2,203 identities, covering diverse body gestures and detailed annotations, facilitating broad generalization. The code, data, and pre-trained models will be released for research purposes.

Paper Structure

This paper contains 19 sections, 5 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: We present a framework designed for the synthesis of human videos driven by audio inputs. Given a single reference image in the first column and an arbitrary audio clip, our methodology produces high-fidelity photo-realistic video outputs depicting the subject engaged in realistic conversation. The synthesized frames illustrate our achievement of accurate lip synchronization, spontaneous eye blinking, and vivid body gestures, collectively pushing the boundaries of audio-driven human synthesis to new heights.
  • Figure 2: The overall framework of Stereo-Talker. Given a single portrait image with its driven audio, we first convert the audio input to human motion sequences based on large language model priors. Then, we render these motions to high-fidelity human videos through a U-net backbone, where a view Mixture-of-Experts (MoE) module and a mask MoE module improve the rendering stability. Notably, we train a mask generation network to predict the human mask at inference time.
  • Figure 3: Our method is capable of synthesizing vivid speaking videos with high temporal stability and view consistency.
  • Figure 4: More visualization results of view consistency.
  • Figure 5: Visualization comparisons with one-shot talking human synthesis method Vlogger corona2024vlogger. Our generated outputs exhibit a broader spectrum of body motion diversity, thereby augmenting the overall expressivity and qualitative richness.
  • ...and 3 more figures