UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models
Cheng Yang, Chufan Shi, Bo Shui, Yaokang Wu, Muzi Tao, Huijuan Wang, Ivan Yee Lee, Yong Liu, Xuezhe Ma, Taylor Berg-Kirkpatrick
TL;DR
This work tackles how explicit reasoning traces influence pixel-level image generation in unified multimodal systems. It introduces UReason, a diagnostic benchmark with 2,000 instances across five reasoning tasks and a controlled evaluation toolkit that compares direct generation, reasoning-guided generation, and de-contextualized generation. Across eight open-source approaches, it reveals a Reasoning Paradox: reasoning aids in planning but verbose intermediate thoughts often interfere with execution, whereas using only the refined execution prompt yields stronger results. The study argues the bottleneck lies in contextual interference rather than reasoning capacity, and it provides a principled framework to develop methods that integrate reasoning with visual generation while mitigating this interference.
Abstract
To elicit capabilities for addressing complex and implicit visual requirements, recent unified multimodal models increasingly adopt chain-of-thought reasoning to guide image generation. However, the actual effect of reasoning on visual synthesis remains unclear. We present UReason, a diagnostic benchmark for reasoning-driven image generation that evaluates whether reasoning can be faithfully executed in pixels. UReason contains 2,000 instances across five task families: Code, Arithmetic, Spatial, Attribute, and Text reasoning. To isolate the role of reasoning traces, we introduce an evaluation framework comparing direct generation, reasoning-guided generation, and de-contextualized generation which conditions only on the refined prompt. Across eight open-source unified models, we observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation, yet retaining intermediate thoughts as conditioning context often hinders visual synthesis, and conditioning only on the refined prompt yields substantial gains. Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity. UReason provides a principled testbed for studying reasoning in unified models and motivates future methods that effectively integrate reasoning for visual generation while mitigating interference.
