HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation
Xiang Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yuqian Zhou, Qing Liu, Shiwei Zhang, Yijun Li, Shaoteng Liu, Haitian Zheng, Jason Kuen, Yuehuan Wang, Changxin Gao, Nong Sang
TL;DR
HBridge tackles the limitations of symmetric Mixture-of-Transformers in unified multimodal understanding and generation by introducing asymmetric heterogeneity, a mid-layer semantic bridge, and semantic reconstruction tokens. The approach pairs a frozen pretrained understanding expert with a diffusion-based generative expert, connects them through a selective mid-layer bridge, and uses 16 Semantic Reconstruction Tokens to ground generation in visual semantics; this reduces attention sharing by over 40% and leverages pretrained priors more efficiently. Empirical results on DPG-Bench, GenEval, and ImgEdit-Bench show state-of-the-art performance with far fewer training tokens (~200B) than prior dense MoT methods (~2.5T), validating a new paradigm for efficient, unified multimodal generation. The work demonstrates that asymmetric architectures with targeted cross-modal interactions can achieve superior perception and synthesis while lowering computational cost, enabling practical deployment of unified multimodal systems.
Abstract
Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.
