Table of Contents
Fetching ...

ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model

Jinwei Qi, Chaonan Ji, Sheng Xu, Peng Zhang, Bang Zhang, Liefeng Bo

TL;DR

This work addresses the challenge of real-time, stylized portrait video generation that faithfully synchronizes expressive facial motions with natural upper-body dynamics. It introduces a two-stage framework: (i) a hierarchical audio-to-motion diffusion model that produces expressive facial and upper-body controls with style transfer, and (ii) a hybrid control fusion video generator that uses explicit landmarks, implicit offsets, hand renderings via MANO, and a face refinement module to deliver high-fidelity, lifelike portrait videos. Key contributions include an efficient hierarchical diffusion process for audio-driven motion, a hybrid explicit-implicit motion representation, explicit hand control integration, and a real-time, extensible video synthesis pipeline that runs at up to 30fps at 512×768 on a 4090 GPU. The approach demonstrates superior quality in facial expressiveness and natural upper-body movements, enabling practical applications in video-chat, virtual avatars, and AR/VR contexts where realistic, controllable digital humans are valuable.

Abstract

Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.

ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model

TL;DR

This work addresses the challenge of real-time, stylized portrait video generation that faithfully synchronizes expressive facial motions with natural upper-body dynamics. It introduces a two-stage framework: (i) a hierarchical audio-to-motion diffusion model that produces expressive facial and upper-body controls with style transfer, and (ii) a hybrid control fusion video generator that uses explicit landmarks, implicit offsets, hand renderings via MANO, and a face refinement module to deliver high-fidelity, lifelike portrait videos. Key contributions include an efficient hierarchical diffusion process for audio-driven motion, a hybrid explicit-implicit motion representation, explicit hand control integration, and a real-time, extensible video synthesis pipeline that runs at up to 30fps at 512×768 on a 4090 GPU. The approach demonstrates superior quality in facial expressiveness and natural upper-body movements, enabling practical applications in video-chat, virtual avatars, and AR/VR contexts where realistic, controllable digital humans are valuable.

Abstract

Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.

Paper Structure

This paper contains 15 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of real-time portrait video generation. Given a portrait image and audio sequence as input, our model can generate high-fidelity animation results from full head to upper-body interaction with diverse facial expressions and style control.
  • Figure 2: Pipeline of upper-body video generation with hybrid control fusion, which takes both explicit facial keypoints and implicit body keypoints to conduct feature warping, while rendered hand image further inject into generator for improving the quality of hand generation.
  • Figure 3: Illustration of hierarchical audio2motion diffusion model, including facial motion prediction with style control at bottom, and upper-body motion prediction with hands at top.
  • Figure 4: Illustration of face refine network, the left of figure shows the architecture, while the right demonstrates that more precise facial keypoints are located by adding implicit offset.
  • Figure 5: Qualitative comparisons of upper-body video generation under self-driven reenactment setup. Our approach significantly outperforms the GAN-based comparison methods, and achieves comparable quality with the diffusion-based method EchoMimicV2.