Table of Contents
Fetching ...

MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation

Bharath Krishnamurthy, Ajita Rattani

Abstract

Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: https://vcbsl.github.io/MMFace-DiT/

MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation

Abstract

Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: https://vcbsl.github.io/MMFace-DiT/

Paper Structure

This paper contains 47 sections, 5 equations, 13 figures, 5 tables, 4 algorithms.

Figures (13)

  • Figure 1: Demonstration of Disentangled Fine-Grained Attribute Control. Our MMFace-DiT exhibits exceptional disentangled control over the synthesis process. Each row is generated from a single, fixed segmentation mask, where we systematically vary a single keyword in the text prompt. The model accurately synthesizes diverse attributes—including color (hats, hair), expression (smiling, sad), gender, and even semantic concepts like background details, showcasing our model's advanced capability for precise, text-guided semantic generation.
  • Figure 2: Overview of the MMFace-DiT Generation Pipeline. Our model operates in a VAE's latent space. During a forward pass, a noisy latent image is converted into a sequence of image tokens. Concurrently, a text prompt is encoded into text tokens by a CLIP encoder, which also produces pooled embeddings for global conditioning. A global conditioning vector, $C_{\text{global}}$, is formed by combining embeddings from the timestep, the text caption, and our novel Modality Embedder, which processes a flag indicating the spatial condition type. The image tokens, text tokens, and $C_{\text{global}}$ are then processed by our core transformer block, which predicts either the noise $\epsilon$ (DDPM) or the velocity $v$ (RFM). The final image is produced by unpatchifying the output tokens and decoding them using the VAE.
  • Figure 2: Disentangled Attribute Control via Sketch-Conditioned Generation. MMFace-DiT exhibits fine-grained disentanglement in multimodal synthesis when guided by sketch-based spatial priors. Each row is generated from a single fixed sketch while systematically varying a single textual attribute (e.g., hair color, shirt color, eye color). The model precisely follows the specified text-based edits while preserving identity, expression, and geometric consistency dictated by the sketch. This demonstrates MMFace-DiT’s capability for precise semantic integration with strong geometric priors.
  • Figure 3: Architecture of the MMFace-DiT Block. The block processes image and text tokens in parallel, modulated by a global conditioning vector ($C_{\text{global}}$) via AdaLN. A shared RoPE Attention layer acts as the central fusion mechanism for deep cross-modal interaction. Following attention and MLP operations, each stream is processed and controlled by a gated residual connection.
  • Figure 3: Mask-Conditioned Synthesis with Diffusion and Flow Paradigms. This figure showcases the high-quality performance of our MMFace-DiT model when trained under both Diffusion and Rectified Flow (Flow) objectives. Each example is generated from an identical segmentation mask (far left) and text prompt (far right), demonstrating the model's robust ability to synthesize diverse and realistic portraits that align with both spatial and semantic guidance. Both training paradigms yield excellent results, successfully interpreting complex attributes like hair style, expression, and accessories. Notably, the Flow-based model often exhibits a particularly refined level of photorealism, producing images with remarkably consistent lighting, skin texture, and fine-grained detail.
  • ...and 8 more figures