Table of Contents
Fetching ...

Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation

Yushe Cao, Dianxi Shi, Xing Fu, Xuechao Zou, Haikuo Peng, Xueqi Li, Chun Yu, Junliang Xing

TL;DR

This work tackles the challenge of insufficient cross-modal interaction in high-fidelity mask-text facial synthesis. It proposes MDiTFace, a diffusion-transformer framework with unified tokenization, a tri-stream multivariate transformer block, and a decoupled attention mechanism that separates dynamic and static computations, enabling caching to reduce mask-related overhead by over 94%. The method uses LoRA fine-tuning and flow-matching loss, achieving state-of-the-art results on MM-CelebA, MM-FFHQ, and MM-FairFace across fidelity and conditional consistency, while enabling robust unimodal and multimodal generation. These innovations substantially improve cross-modal fusion efficiency and quality, with practical impact for multimodal facial generation tasks and downstream applications requiring precise mask-text control.

Abstract

While significant progress has been achieved in multimodal facial generation using semantic masks and textual descriptions, conventional feature fusion approaches often fail to enable effective cross-modal interactions, thereby leading to suboptimal generation outcomes. To address this challenge, we introduce MDiTFace--a customized diffusion transformer framework that employs a unified tokenization strategy to process semantic mask and text inputs, eliminating discrepancies between heterogeneous modality representations. The framework facilitates comprehensive multimodal feature interaction through stacked, newly designed multivariate transformer blocks that process all conditions synchronously. Additionally, we design a novel decoupled attention mechanism by dissociating implicit dependencies between mask tokens and temporal embeddings. This mechanism segregates internal computations into dynamic and static pathways, enabling caching and reuse of features computed in static pathways after initial calculation, thereby reducing additional computational overhead introduced by mask condition by over 94% while maintaining performance. Extensive experiments demonstrate that MDiTFace significantly outperforms other competing methods in terms of both facial fidelity and conditional consistency.

Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation

TL;DR

This work tackles the challenge of insufficient cross-modal interaction in high-fidelity mask-text facial synthesis. It proposes MDiTFace, a diffusion-transformer framework with unified tokenization, a tri-stream multivariate transformer block, and a decoupled attention mechanism that separates dynamic and static computations, enabling caching to reduce mask-related overhead by over 94%. The method uses LoRA fine-tuning and flow-matching loss, achieving state-of-the-art results on MM-CelebA, MM-FFHQ, and MM-FairFace across fidelity and conditional consistency, while enabling robust unimodal and multimodal generation. These innovations substantially improve cross-modal fusion efficiency and quality, with practical impact for multimodal facial generation tasks and downstream applications requiring precise mask-text control.

Abstract

While significant progress has been achieved in multimodal facial generation using semantic masks and textual descriptions, conventional feature fusion approaches often fail to enable effective cross-modal interactions, thereby leading to suboptimal generation outcomes. To address this challenge, we introduce MDiTFace--a customized diffusion transformer framework that employs a unified tokenization strategy to process semantic mask and text inputs, eliminating discrepancies between heterogeneous modality representations. The framework facilitates comprehensive multimodal feature interaction through stacked, newly designed multivariate transformer blocks that process all conditions synchronously. Additionally, we design a novel decoupled attention mechanism by dissociating implicit dependencies between mask tokens and temporal embeddings. This mechanism segregates internal computations into dynamic and static pathways, enabling caching and reuse of features computed in static pathways after initial calculation, thereby reducing additional computational overhead introduced by mask condition by over 94% while maintaining performance. Extensive experiments demonstrate that MDiTFace significantly outperforms other competing methods in terms of both facial fidelity and conditional consistency.

Paper Structure

This paper contains 24 sections, 12 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Overall framework of our MDiTFace method.
  • Figure 2: Internal attention design of the multivariate transformer block. (a) The vanilla dual-stream attention in FLUX.1, which exclusively supports text-modal conditioning. (b) Extended holistic tri-stream attention supporting mask-text multimodal conditions at significantly increased computational cost; (c) Hard-decoupled attention with dynamic and static pathways,efficiency improves, but at the cost of performance degradation; (d) Improved decoupled attention restoring mask-to-text perceptual pathways for balanced efficiency and model performance.
  • Figure 3: Qualitative comparison with state-of-the-art methods of mask-text collaborative facial generation.
  • Figure 4: User study. Our method secured the highest proportion of user support.
  • Figure 5: Additional computational overhead introduced by mask condition.
  • ...and 7 more figures