Table of Contents
Fetching ...

FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn, Lu Lu

TL;DR

FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality and demonstrates that it consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.

Abstract

Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.

FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

TL;DR

FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality and demonstrates that it consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.

Abstract

Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.
Paper Structure (22 sections, 18 equations, 8 figures, 11 tables)

This paper contains 22 sections, 18 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Overview of FlowPortrait. We build upon a pretrained MLLM to enable audio-to-video generation for portrait animation. Using an improved Flow-GRPO objective, we post-train the AR-Flow generator with a composite reward that integrates MLLM-based evaluations with perceptual and temporal consistency terms, leading to consistent improvements in generation quality.
  • Figure 2: Error rates of different generator types. Each bar represents the error rate for a specific generator type, with annotations indicating the number of error cases.
  • Figure 3: Score distribution histograms for the three evaluation aspects: lip-sync quality, facial expressiveness, and motion smoothness. Each histogram illustrates the frequency of scores assigned by Human (blue) and MAS-MA evaluation system (orange) across the range of possible scores (1 to 5).
  • Figure 4: Cherrypicked failure cases from our SFT model, including blurred textures and hallucinated artifacts. RL post-training effectively mitigates these issues. The black masks are used to anonymize identities.
  • Figure 5: Comparison of reward curve during post-training for various noise levels during sampling stage.
  • ...and 3 more figures