Table of Contents
Fetching ...

MegActor-$Σ$: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer

Shurong Yang, Huadong Li, Juhao Wu, Minhao Jing, Linze Li, Renhe Ji, Jiajun Liang, Haoqiang Fan, Jin Wang

TL;DR

This work tackles the challenge of flexible mixed-modal control in portrait animation by introducing MegActor-$Σ$, a mixed-modal conditional diffusion transformer (DiT) that fuses visual and audio signals without relying on private data. It implements a three-stage Modality Decoupling Control training regime and an Amplitude Adjustment inference strategy to balance control strengths and freely scale motion amplitudes across modalities. The model is trained on a rigorously filtered public dataset and demonstrates superior performance over prior methods trained on private data, achieving vivid, coherent, and identity-preserving animations with flexible modality combinations. By providing dataset evaluation metrics and a public 313-hour training corpus, the work fosters open research and reproducibility in multimodal portrait animation.

Abstract

Diffusion models have demonstrated superior performance in the field of portrait animation. However, current approaches relied on either visual or audio modality to control character movements, failing to exploit the potential of mixed-modal control. This challenge arises from the difficulty in balancing the weak control strength of audio modality and the strong control strength of visual modality. To address this issue, we introduce MegActor-$Σ$: a mixed-modal conditional diffusion transformer (DiT), which can flexibly inject audio and visual modality control signals into portrait animation. Specifically, we make substantial advancements over its predecessor, MegActor, by leveraging the promising model structure of DiT and integrating audio and visual conditions through advanced modules within the DiT framework. To further achieve flexible combinations of mixed-modal control signals, we propose a ``Modality Decoupling Control" training strategy to balance the control strength between visual and audio modalities, along with the ``Amplitude Adjustment" inference strategy to freely regulate the motion amplitude of each modality. Finally, to facilitate extensive studies in this field, we design several dataset evaluation metrics to filter out public datasets and solely use this filtered dataset to train MegActor-$Σ$. Extensive experiments demonstrate the superiority of our approach in generating vivid portrait animations, outperforming previous methods trained on private dataset.

MegActor-$Σ$: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer

TL;DR

This work tackles the challenge of flexible mixed-modal control in portrait animation by introducing MegActor-, a mixed-modal conditional diffusion transformer (DiT) that fuses visual and audio signals without relying on private data. It implements a three-stage Modality Decoupling Control training regime and an Amplitude Adjustment inference strategy to balance control strengths and freely scale motion amplitudes across modalities. The model is trained on a rigorously filtered public dataset and demonstrates superior performance over prior methods trained on private data, achieving vivid, coherent, and identity-preserving animations with flexible modality combinations. By providing dataset evaluation metrics and a public 313-hour training corpus, the work fosters open research and reproducibility in multimodal portrait animation.

Abstract

Diffusion models have demonstrated superior performance in the field of portrait animation. However, current approaches relied on either visual or audio modality to control character movements, failing to exploit the potential of mixed-modal control. This challenge arises from the difficulty in balancing the weak control strength of audio modality and the strong control strength of visual modality. To address this issue, we introduce MegActor-: a mixed-modal conditional diffusion transformer (DiT), which can flexibly inject audio and visual modality control signals into portrait animation. Specifically, we make substantial advancements over its predecessor, MegActor, by leveraging the promising model structure of DiT and integrating audio and visual conditions through advanced modules within the DiT framework. To further achieve flexible combinations of mixed-modal control signals, we propose a ``Modality Decoupling Control" training strategy to balance the control strength between visual and audio modalities, along with the ``Amplitude Adjustment" inference strategy to freely regulate the motion amplitude of each modality. Finally, to facilitate extensive studies in this field, we design several dataset evaluation metrics to filter out public datasets and solely use this filtered dataset to train MegActor-. Extensive experiments demonstrate the superiority of our approach in generating vivid portrait animations, outperforming previous methods trained on private dataset.
Paper Structure (18 sections, 4 equations, 5 figures, 4 tables)

This paper contains 18 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Qualitative results of MegActor-$\Sigma$ in generating high-quality and flexible portrait animations, include: 1) Modality Flexibility, enabling control through visual, audio or mixed-modal control; 2) Amplitude Flexibility, enabling adjustment of the scale of head movement and speech amplitude. Moreover, MegActor-$\Sigma$ is trained purely on public datasets, which successfully outperforms previous closed-source methods. Please see our project page for detailed comparisons.
  • Figure 2: The visualization of visual leakage. Even when we remove mouth-driven components in visual modality, as V-Express wang2024v does, the generated results still exhibit a certain pattern of speaking without audio-driven.
  • Figure 3: Mixed-modal DiT architecture of MegActor-$\Sigma$.
  • Figure 4: The overall framework of "Modality Decoupling Control" training strategy. Firstly, we utilize face dropout to control partial signals (e.g., eyes or mouth). Masked spatial attention and maksed MSE loss are then applied to ensure that the control of the mouth region is decoupled. Secondly, we integrate audio for mixed-modal control with face dropout to dynamically balance control strength of the audio and visual modalities. Finally, temporal layers are further introduced to learn motion priors.
  • Figure 5: Qualitative comparisons. Proposed MegActor-$\Sigma$, driven by original images and audio, achieves accurate facial expression transfer (e.g., consistent head and eye movements) and exhibits precise identity resemblance (e.g., facial shape). The reference portraits in rows 1-3 are from VFHQ xie2022vfhq, and the rest are from the Internet.