ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model
Jinwei Qi, Chaonan Ji, Sheng Xu, Peng Zhang, Bang Zhang, Liefeng Bo
TL;DR
This work addresses the challenge of real-time, stylized portrait video generation that faithfully synchronizes expressive facial motions with natural upper-body dynamics. It introduces a two-stage framework: (i) a hierarchical audio-to-motion diffusion model that produces expressive facial and upper-body controls with style transfer, and (ii) a hybrid control fusion video generator that uses explicit landmarks, implicit offsets, hand renderings via MANO, and a face refinement module to deliver high-fidelity, lifelike portrait videos. Key contributions include an efficient hierarchical diffusion process for audio-driven motion, a hybrid explicit-implicit motion representation, explicit hand control integration, and a real-time, extensible video synthesis pipeline that runs at up to 30fps at 512×768 on a 4090 GPU. The approach demonstrates superior quality in facial expressiveness and natural upper-body movements, enabling practical applications in video-chat, virtual avatars, and AR/VR contexts where realistic, controllable digital humans are valuable.
Abstract
Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.
