Table of Contents
Fetching ...

Growing Visual Generative Capacity for Pre-Trained MLLMs

Hanyu Wang, Jiaming Han, Ziyan Yang, Qi Zhao, Shanchuan Lin, Xiangyu Yue, Abhinav Shrivastava, Zhenheng Yang, Hao Chen

TL;DR

Bridge tackles the challenge of building a unified multimodal LLM that can both understand and generate visuals without abandoning the autoregressive paradigm. It builds on a pre-trained visual understanding backbone by introducing a Mixture-of-Transformers with a frozen understanding expert and a trainable generation expert, enabling single next-token prediction for text and discrete image tokens. A semantic-to-pixel discrete representation blends 81 high-level semantic tokens with 1024 pixel tokens, improving language alignment and visual fidelity with only a 7.9% increase in token length. Across diverse benchmarks, Bridge matches or surpasses prior unified MLLMs on understanding and generation while requiring less data and training time, highlighting a more efficient path to joint visual reasoning and creation.

Abstract

Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.

Growing Visual Generative Capacity for Pre-Trained MLLMs

TL;DR

Bridge tackles the challenge of building a unified multimodal LLM that can both understand and generate visuals without abandoning the autoregressive paradigm. It builds on a pre-trained visual understanding backbone by introducing a Mixture-of-Transformers with a frozen understanding expert and a trainable generation expert, enabling single next-token prediction for text and discrete image tokens. A semantic-to-pixel discrete representation blends 81 high-level semantic tokens with 1024 pixel tokens, improving language alignment and visual fidelity with only a 7.9% increase in token length. Across diverse benchmarks, Bridge matches or surpasses prior unified MLLMs on understanding and generation while requiring less data and training time, highlighting a more efficient path to joint visual reasoning and creation.

Abstract

Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.

Paper Structure

This paper contains 29 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Qualitative Results on text-to-image generation and image editing tasks.
  • Figure 2: Method overview. Bridge adopts a Mixture-of-Transformers (MoT) architecture with two experts: a frozen understanding (Und.) expert for text and visual understanding tokens, and a newly trained generation (Gen.) expert for visual generation tokens. Both experts share unified causal attention across all tokens. Visual generation representation are constructed by concatenating short semantic token sequences with longer pixel token sequences, which are modeled jointly with text tokens under a unified next-token prediction objective. Semantic tokens serve as a bridge between text and pixel modalities, substantially improving visual generation quality.
  • Figure 3: More Visualization of our model on text-to-image generation.