Table of Contents
Fetching ...

UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models

Cheng Yang, Chufan Shi, Bo Shui, Yaokang Wu, Muzi Tao, Huijuan Wang, Ivan Yee Lee, Yong Liu, Xuezhe Ma, Taylor Berg-Kirkpatrick

TL;DR

This work tackles how explicit reasoning traces influence pixel-level image generation in unified multimodal systems. It introduces UReason, a diagnostic benchmark with 2,000 instances across five reasoning tasks and a controlled evaluation toolkit that compares direct generation, reasoning-guided generation, and de-contextualized generation. Across eight open-source approaches, it reveals a Reasoning Paradox: reasoning aids in planning but verbose intermediate thoughts often interfere with execution, whereas using only the refined execution prompt yields stronger results. The study argues the bottleneck lies in contextual interference rather than reasoning capacity, and it provides a principled framework to develop methods that integrate reasoning with visual generation while mitigating this interference.

Abstract

To elicit capabilities for addressing complex and implicit visual requirements, recent unified multimodal models increasingly adopt chain-of-thought reasoning to guide image generation. However, the actual effect of reasoning on visual synthesis remains unclear. We present UReason, a diagnostic benchmark for reasoning-driven image generation that evaluates whether reasoning can be faithfully executed in pixels. UReason contains 2,000 instances across five task families: Code, Arithmetic, Spatial, Attribute, and Text reasoning. To isolate the role of reasoning traces, we introduce an evaluation framework comparing direct generation, reasoning-guided generation, and de-contextualized generation which conditions only on the refined prompt. Across eight open-source unified models, we observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation, yet retaining intermediate thoughts as conditioning context often hinders visual synthesis, and conditioning only on the refined prompt yields substantial gains. Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity. UReason provides a principled testbed for studying reasoning in unified models and motivates future methods that effectively integrate reasoning for visual generation while mitigating interference.

UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models

TL;DR

This work tackles how explicit reasoning traces influence pixel-level image generation in unified multimodal systems. It introduces UReason, a diagnostic benchmark with 2,000 instances across five reasoning tasks and a controlled evaluation toolkit that compares direct generation, reasoning-guided generation, and de-contextualized generation. Across eight open-source approaches, it reveals a Reasoning Paradox: reasoning aids in planning but verbose intermediate thoughts often interfere with execution, whereas using only the refined execution prompt yields stronger results. The study argues the bottleneck lies in contextual interference rather than reasoning capacity, and it provides a principled framework to develop methods that integrate reasoning with visual generation while mitigating this interference.

Abstract

To elicit capabilities for addressing complex and implicit visual requirements, recent unified multimodal models increasingly adopt chain-of-thought reasoning to guide image generation. However, the actual effect of reasoning on visual synthesis remains unclear. We present UReason, a diagnostic benchmark for reasoning-driven image generation that evaluates whether reasoning can be faithfully executed in pixels. UReason contains 2,000 instances across five task families: Code, Arithmetic, Spatial, Attribute, and Text reasoning. To isolate the role of reasoning traces, we introduce an evaluation framework comparing direct generation, reasoning-guided generation, and de-contextualized generation which conditions only on the refined prompt. Across eight open-source unified models, we observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation, yet retaining intermediate thoughts as conditioning context often hinders visual synthesis, and conditioning only on the refined prompt yields substantial gains. Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity. UReason provides a principled testbed for studying reasoning in unified models and motivates future methods that effectively integrate reasoning for visual generation while mitigating interference.
Paper Structure (63 sections, 5 equations, 20 figures, 6 tables)

This paper contains 63 sections, 5 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Representative UReason instances covering Code, Arithmetic, Spatial, Attribute, and Text reasoning. Prompts specify implicit targets that must be derived via reasoning, enabling diagnostic evaluation of whether reasoning traces remain executable during image synthesis. Detailed task descriptions are listed in Appx. \ref{['app:data_curation']}.
  • Figure 2: Overview of UReason evaluation framework. UReason compares $3$ settings: 1 Direct Generation, 2 Reasoning-Guided Generation and 3 De-contextualized Generation.
  • Figure 3: Comparison of internal (Bagel) and external (Qwen) prompt rewriting across $5$ tasks.
  • Figure 4: Taxonomy of UReason tasks. The benchmark contains $5$ task categories with $30$ fine-grained subcategories covering diverse reasoning and visual generation challenges.
  • Figure 5: Screenshot of the interface used for human evaluation.
  • ...and 15 more figures