DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Xu Guo; Fulong Ye; Qichao Sun; Liyang Chen; Bingchuan Li; Pengze Zhang; Jiawei Liu; Songtao Zhao; Qian He; Xiangwang Hou

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Songtao Zhao, Qian He, Xiangwang Hou

TL;DR

DreamID-Omni introduces a unified framework for controllable human-centric audio-video generation by unifying R2AV, RV2AV, and RA2V within a Symmetric Conditional Diffusion Transformer. It tackles identity-timbre binding and multi-person disentanglement through Dual-Level Disentanglement, combining Syn-RoPE at the signal level with Structured Captions at the semantic level. A three-stage Multi-Task Progressive Training curriculum harmonizes weakly- and strongly-constrained tasks, preventing overfitting and enabling robust cross-task performance. The approach achieves state-of-the-art results across video, audio, and audio-visual consistency on IDBench-Omni, outperforming leading commercial models, and the authors plan to release code to democratize access to this capability.

Abstract

Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 13 figures, 7 tables)

This paper contains 22 sections, 4 equations, 13 figures, 7 tables.

Introduction
Related Work
Joint Audio-Video Generation
Controllable Video Generation Model
Methodology
Problem Formulation
Framework
Symmetric Conditional DiT
Dual-Level Disentanglement
Multi-Task Progressive Training
Inference Pipeline
Experiments
Setup
Comparison
Ablation Studies
...and 7 more sections

Figures (13)

Figure 1: Showcase of DreamID-Omni. DreamID-Omni seamlessly unifies reference-based audio-video generation (R2AV), video editing (RV2AV), and audio-driven video animation (RA2V).
Figure 2: Overview of DreamID-Omni framework. We integrate reference-based generation (R2AV), editing (RV2AV), and animation (RA2V) using a Symmetric Conditional DiT trained via a multi-task progressive training strategy. Structured Caption and Syn-RoPE ensure robust dual-level disentanglement in multi-person scenarios.
Figure 3: Qualitative comparison with state-of-the-art (SOTA) methods on R2AV. Please zoom in for more details.
Figure 4: Qualitative comparison with SOTA methods on RV2AV. Please zoom in for more details.
Figure 5: Qualitative comparison with SOTA methods on RA2V. Please zoom in for more details.
...and 8 more figures

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

TL;DR

Abstract

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)