MegActor-$Σ$: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer
Shurong Yang, Huadong Li, Juhao Wu, Minhao Jing, Linze Li, Renhe Ji, Jiajun Liang, Haoqiang Fan, Jin Wang
TL;DR
This work tackles the challenge of flexible mixed-modal control in portrait animation by introducing MegActor-$Σ$, a mixed-modal conditional diffusion transformer (DiT) that fuses visual and audio signals without relying on private data. It implements a three-stage Modality Decoupling Control training regime and an Amplitude Adjustment inference strategy to balance control strengths and freely scale motion amplitudes across modalities. The model is trained on a rigorously filtered public dataset and demonstrates superior performance over prior methods trained on private data, achieving vivid, coherent, and identity-preserving animations with flexible modality combinations. By providing dataset evaluation metrics and a public 313-hour training corpus, the work fosters open research and reproducibility in multimodal portrait animation.
Abstract
Diffusion models have demonstrated superior performance in the field of portrait animation. However, current approaches relied on either visual or audio modality to control character movements, failing to exploit the potential of mixed-modal control. This challenge arises from the difficulty in balancing the weak control strength of audio modality and the strong control strength of visual modality. To address this issue, we introduce MegActor-$Σ$: a mixed-modal conditional diffusion transformer (DiT), which can flexibly inject audio and visual modality control signals into portrait animation. Specifically, we make substantial advancements over its predecessor, MegActor, by leveraging the promising model structure of DiT and integrating audio and visual conditions through advanced modules within the DiT framework. To further achieve flexible combinations of mixed-modal control signals, we propose a ``Modality Decoupling Control" training strategy to balance the control strength between visual and audio modalities, along with the ``Amplitude Adjustment" inference strategy to freely regulate the motion amplitude of each modality. Finally, to facilitate extensive studies in this field, we design several dataset evaluation metrics to filter out public datasets and solely use this filtered dataset to train MegActor-$Σ$. Extensive experiments demonstrate the superiority of our approach in generating vivid portrait animations, outperforming previous methods trained on private dataset.
