Table of Contents
Fetching ...

Unified Multimodal Model as Auto-Encoder

Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan

TL;DR

The paper argues that true unification of multimodal understanding and generation requires a foundational objective rather than merely coupling tasks. By recasting understanding as an encoder and generation as a decoder, it introduces UAE, which pre-trains a long-context decoder on 700k detailed image captions and then uses a two-stage reinforcement-learning procedure (Unified-GRPO) to optimize a unified reconstruction objective. A dedicated Unified-Bench benchmark quantifies bidirectional benefits and demonstrates that understanding can enhance generation and vice versa, though gains are bounded by current generation-model limitations. Collectively, these findings advance toward genuinely unified multimodal intelligence and provide practical evaluation tools and data for the community.

Abstract

The pursuit of unified multimodal models (UMMs) has long been hindered by a fundamental schism between multimodal understanding and generation. Current approaches typically disentangle the two and treat them as separate endeavors with disjoint objectives, missing the mutual benefits. We argue that true unification requires more than just merging two tasks. It requires a unified, foundational objective that intrinsically links them. In this paper, we introduce an insightful paradigm through the Auto-Encoder lens, i.e., regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. To implement this, we propose UAE, where we begin by pre-training the decoder with the proposed 700k long-context image-caption pairs to direct it to "understand" the fine-grained and complex semantics from the text. We then propose Unified-GRPO via reinforcement learning (RL) to unify the two, which covers two complementary stages: (1) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual perception; (2) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception like small object and color recognition (verified on MMT-Bench). This bidirectional improvement reveals a deep synergy: under the unified reconstruction objective, generation and understanding can mutually benefit each other, moving closer to truly unified multimodal intelligence.

Unified Multimodal Model as Auto-Encoder

TL;DR

The paper argues that true unification of multimodal understanding and generation requires a foundational objective rather than merely coupling tasks. By recasting understanding as an encoder and generation as a decoder, it introduces UAE, which pre-trains a long-context decoder on 700k detailed image captions and then uses a two-stage reinforcement-learning procedure (Unified-GRPO) to optimize a unified reconstruction objective. A dedicated Unified-Bench benchmark quantifies bidirectional benefits and demonstrates that understanding can enhance generation and vice versa, though gains are bounded by current generation-model limitations. Collectively, these findings advance toward genuinely unified multimodal intelligence and provide practical evaluation tools and data for the community.

Abstract

The pursuit of unified multimodal models (UMMs) has long been hindered by a fundamental schism between multimodal understanding and generation. Current approaches typically disentangle the two and treat them as separate endeavors with disjoint objectives, missing the mutual benefits. We argue that true unification requires more than just merging two tasks. It requires a unified, foundational objective that intrinsically links them. In this paper, we introduce an insightful paradigm through the Auto-Encoder lens, i.e., regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. To implement this, we propose UAE, where we begin by pre-training the decoder with the proposed 700k long-context image-caption pairs to direct it to "understand" the fine-grained and complex semantics from the text. We then propose Unified-GRPO via reinforcement learning (RL) to unify the two, which covers two complementary stages: (1) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual perception; (2) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception like small object and color recognition (verified on MMT-Bench). This bidirectional improvement reveals a deep synergy: under the unified reconstruction objective, generation and understanding can mutually benefit each other, moving closer to truly unified multimodal intelligence.

Paper Structure

This paper contains 32 sections, 3 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Illustration of the key insight of our UAE, an Auto-Encoder inspired design, for unified multimodal understanding and generation. We treat the understanding model as the encoder and the generation model as the decoder. Using the reconstruction similarity as the unified score, we use RL to maximize it (Unified-GRPO) and utilize it to evaluate the degree of unification (Unified-Bench).
  • Figure 2: The overall workflow of our UAE, consisting of three stages: long-context pre-training (stage-1), generation for understanding (stage-2), and understanding for generation (stage-3). We name our post-training method as Unified-GRPO (the last two RL stages), which utilize a single, unified reconstruction objective for optimization.
  • Figure 3: Qualitative results on the complex and long-context generation. Our method can recover very detailed semantics from the highly descriptive input caption over the baseline, demonstrating that improved understanding can notably benefit generation.
  • Figure 4: Reconstruction results vs. RL training steps. With the RL steps increasing, the understanding model (encoder) achieves better caption capability to produce a longer, detailed, yet accurate caption to reconstruct the original image comprehensively; while the generation model (decoder) can take the detailed caption as input for better generation. See appendix for more examples.
  • Figure 5: The illustration of the distribution of our proposed 700k long-context dataset.
  • ...and 10 more figures