FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

Weiting Tan; Andy T. Liu; Ming Tu; Xinghua Qu; Philipp Koehn; Lu Lu

FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn, Lu Lu

TL;DR

FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality and demonstrates that it consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.

Abstract

Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.

FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

TL;DR

Abstract

Paper Structure (22 sections, 18 equations, 8 figures, 11 tables)

This paper contains 22 sections, 18 equations, 8 figures, 11 tables.

Introduction
Automatic Evaluation for Portrait Annimation
Existing Evaluation Metrics
Automatic Evaluation via Multimodal LLMs
Automatic Evaluation's Alignment with Human Judgment
Deep Dive into Multi-Aspect, Multi-Agent Evaluation
Training Methodology
Autoregressive Rectified Flow Model
Reinforcement Learning with Flow-GRPO
Reward System Design
Experiments
Setup
Main Results
Analysis and Ablation Studies
Related Work
...and 7 more sections

Figures (8)

Figure 1: Overview of FlowPortrait. We build upon a pretrained MLLM to enable audio-to-video generation for portrait animation. Using an improved Flow-GRPO objective, we post-train the AR-Flow generator with a composite reward that integrates MLLM-based evaluations with perceptual and temporal consistency terms, leading to consistent improvements in generation quality.
Figure 2: Error rates of different generator types. Each bar represents the error rate for a specific generator type, with annotations indicating the number of error cases.
Figure 3: Score distribution histograms for the three evaluation aspects: lip-sync quality, facial expressiveness, and motion smoothness. Each histogram illustrates the frequency of scores assigned by Human (blue) and MAS-MA evaluation system (orange) across the range of possible scores (1 to 5).
Figure 4: Cherrypicked failure cases from our SFT model, including blurred textures and hallucinated artifacts. RL post-training effectively mitigates these issues. The black masks are used to anonymize identities.
Figure 5: Comparison of reward curve during post-training for various noise levels during sampling stage.
...and 3 more figures

FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

TL;DR

Abstract

FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)