Table of Contents
Fetching ...

HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

Xiang Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yuqian Zhou, Qing Liu, Shiwei Zhang, Yijun Li, Shaoteng Liu, Haitian Zheng, Jason Kuen, Yuehuan Wang, Changxin Gao, Nong Sang

TL;DR

HBridge tackles the limitations of symmetric Mixture-of-Transformers in unified multimodal understanding and generation by introducing asymmetric heterogeneity, a mid-layer semantic bridge, and semantic reconstruction tokens. The approach pairs a frozen pretrained understanding expert with a diffusion-based generative expert, connects them through a selective mid-layer bridge, and uses 16 Semantic Reconstruction Tokens to ground generation in visual semantics; this reduces attention sharing by over 40% and leverages pretrained priors more efficiently. Empirical results on DPG-Bench, GenEval, and ImgEdit-Bench show state-of-the-art performance with far fewer training tokens (~200B) than prior dense MoT methods (~2.5T), validating a new paradigm for efficient, unified multimodal generation. The work demonstrates that asymmetric architectures with targeted cross-modal interactions can achieve superior perception and synthesis while lowering computational cost, enabling practical deployment of unified multimodal systems.

Abstract

Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.

HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

TL;DR

HBridge tackles the limitations of symmetric Mixture-of-Transformers in unified multimodal understanding and generation by introducing asymmetric heterogeneity, a mid-layer semantic bridge, and semantic reconstruction tokens. The approach pairs a frozen pretrained understanding expert with a diffusion-based generative expert, connects them through a selective mid-layer bridge, and uses 16 Semantic Reconstruction Tokens to ground generation in visual semantics; this reduces attention sharing by over 40% and leverages pretrained priors more efficiently. Empirical results on DPG-Bench, GenEval, and ImgEdit-Bench show state-of-the-art performance with far fewer training tokens (~200B) than prior dense MoT methods (~2.5T), validating a new paradigm for efficient, unified multimodal generation. The work demonstrates that asymmetric architectures with targeted cross-modal interactions can achieve superior perception and synthesis while lowering computational cost, enabling practical deployment of unified multimodal systems.

Abstract

Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.

Paper Structure

This paper contains 14 sections, 3 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Image generation and editing samples from HBridge, which achieves high-quality and photorealistic results.
  • Figure 3: Overview of the proposed HBridge. We pair an arbitrary pretrained understanding expert with a pretrained generative expert and connect their mid-layers through self-attention. In practice, the understanding expert is typically a VLM, while the generative expert is often a DiT variant. QKV-Linear modules are applied to align their feature dimensions. Additionally, we introduce learnable semantic tokens that explicitly reconstruct visual semantic tokens of the target image, improving text alignment and enhancing generation quality.
  • Figure 4: Ablation study on different initialization manners. We treat the understanding and generative experts initialized with Qwen2.5-VL-7B as the baseline, and our HBridge utilizes the pretrained diffusion initialization from wu2025omnigen2.
  • Figure 5: Ablation study on varying the number of skipped layers.
  • Figure 6: Analysis of varying the skipped layer on DPG-Bench.
  • ...and 6 more figures