Table of Contents
Fetching ...

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, Li Yuan

TL;DR

The paper investigates whether understanding truly informs generation in unified multimodal models using UniSandBox, a decoupled framework with synthetic, leak-free data that isolates Knowledge and Reasoning. It shows a significant understanding-generation gap, with explicit Chain-of-Thought (CoT) activation bridging the gap in both reasoning-based generation and knowledge transfer, and demonstrates that self-training (STARS) can internalize this capability for implicit reasoning. It also reveals that CoT acts as an effective activator for transferring newly injected knowledge and uncovers latent CoT-like properties in query-based architectures. The findings offer design guidance for future unified multimodal models, suggesting curriculum learning and query-based strategies to better couple understanding with generation while mitigating data leakage and overfitting.

Abstract

Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

TL;DR

The paper investigates whether understanding truly informs generation in unified multimodal models using UniSandBox, a decoupled framework with synthetic, leak-free data that isolates Knowledge and Reasoning. It shows a significant understanding-generation gap, with explicit Chain-of-Thought (CoT) activation bridging the gap in both reasoning-based generation and knowledge transfer, and demonstrates that self-training (STARS) can internalize this capability for implicit reasoning. It also reveals that CoT acts as an effective activator for transferring newly injected knowledge and uncovers latent CoT-like properties in query-based architectures. The findings offer design guidance for future unified multimodal models, suggesting curriculum learning and query-based strategies to better couple understanding with generation while mitigating data leakage and overfitting.

Abstract

Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox

Paper Structure

This paper contains 28 sections, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Data examples for reasoning generation. All images are generated by BAGEL. Normal and CoT represent generation without/with think mode (Chain-of-Thought mode), respectively. We also shows the relative prompts.
  • Figure 2: Overview of the STARS framework. It illustrates the three sequential stages: (I) Data Generation, where CoT is leveraged to create reasoning-generation pairs; (II) Sample Filtering, which uses the understanding module of UMMs to curate high-quality data; and (III) Fine-tuning, where the unified model is trained with the filtered data to distill CoT reasoning into its standard generation process.
  • Figure 3: Framework for Knowledge Transfer evaluation. The framework first injects novel knowledge (Virtual Character Profiles, left) into the Unified Model's understanding module via fine-tuning. We then evaluate the model's ability to utilize this new knowledge through two distinct generative tasks: Forward Retrieval (Key $\rightarrow$ Value), which requires generating an attribute from a name, and Inverse Search (Value $\rightarrow$ Key), which requires identifying and generating a character based on their attributes (right).
  • Figure 4: We visualize the total probability of relevant words corresponding to different queries. The "Last Text Token" entry, serving as a baseline, presents the probability of the last text token from the MLLM before the query. For clarity, only queries with probabilities exceeding 0.01 are displayed.
  • Figure 5: The average results of Bagel(normal) on Math for the ablation of Reject Sampling.