Table of Contents
Fetching ...

Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization

Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, Siyu Zhu

TL;DR

This work tackles the challenge of producing high-fidelity, audio-driven portrait animations that plausibly synchronize lip motion and complex facial expressions with fast body dynamics. It introduces direct preference optimization (DPO) trained on a curated human-preference dataset to align outputs with lip-sync accuracy and expressive naturalness, combined with a unified temporal motion modulation strategy that preserves high-frequency motion details in latent video representations. The approach is compatible with both UNet and DiT diffusion backbones and demonstrates state-of-the-art lip-sync and expression fidelity on benchmarks like HDTF and Celeb-V, with strong generalization to skeletal-driven motion. By releasing the code and a specialized preference dataset, the work provides a practical pathway toward more perceptually aligned, high-fidelity portrait synthesis in real-world applications.

Abstract

Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Our model and source code can be found at: https://github.com/fudan-generative-vision/hallo4.

Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization

TL;DR

This work tackles the challenge of producing high-fidelity, audio-driven portrait animations that plausibly synchronize lip motion and complex facial expressions with fast body dynamics. It introduces direct preference optimization (DPO) trained on a curated human-preference dataset to align outputs with lip-sync accuracy and expressive naturalness, combined with a unified temporal motion modulation strategy that preserves high-frequency motion details in latent video representations. The approach is compatible with both UNet and DiT diffusion backbones and demonstrates state-of-the-art lip-sync and expression fidelity on benchmarks like HDTF and Celeb-V, with strong generalization to skeletal-driven motion. By releasing the code and a specialized preference dataset, the work provides a practical pathway toward more perceptually aligned, high-fidelity portrait synthesis in real-world applications.

Abstract

Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Our model and source code can be found at: https://github.com/fudan-generative-vision/hallo4.

Paper Structure

This paper contains 16 sections, 8 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Illustration of the proposed portrait animation framework. Given a reference portrait image and multimodal control signals (audio waveform with optional skeletal motion sequences), our method generates high-fidelity, dynamically coherent animations through two key innovations: (1) direct preference optimization for human-aligned synchronization and expressiveness, and (2) unified temporal motion modulation to preserve high-frequency body motion details. The framework achieves accurate lip-audio synchronization, natural facial expressions, and robust handling of rapid speech rhythms and abrupt upper-body motions across diverse character identities and environmental scenarios.
  • Figure 2: Demonstration of direct preference optimization for audio-driven portrait animation.
  • Figure 3: Demonstration of a DiT-based portrait generative pipeline with unified temporal motion modulation.
  • Figure 4: Qualitative comparison on HDTF and Celeb-V dataset.
  • Figure 5: Qualitative comparison on half-body EMTD dataset.
  • ...and 3 more figures