Table of Contents
Fetching ...

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen

TL;DR

GIR-Bench introduces a reasoning-centric benchmark for unified multimodal systems to evaluate how understanding aligns with generation and editing tasks. It provides three task-specific pipelines—UGC for consistency, T2I for reasoning-based generation, and Edit for multi-step editing—along with explicit metrics that avoid MLLM-as-a-Judge biases. Empirical results across 21 participants show that unified architectures offer gains in reasoning-driven generation but still struggle to faithfully transfer reasoning into visual content, highlighting a persistent gap between understanding and generation. The benchmark thus enables fine-grained analysis and guides future research toward integrated reasoning and generation in multimodal systems.

Abstract

Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at \href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench}.

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

TL;DR

GIR-Bench introduces a reasoning-centric benchmark for unified multimodal systems to evaluate how understanding aligns with generation and editing tasks. It provides three task-specific pipelines—UGC for consistency, T2I for reasoning-based generation, and Edit for multi-step editing—along with explicit metrics that avoid MLLM-as-a-Judge biases. Empirical results across 21 participants show that unified architectures offer gains in reasoning-driven generation but still struggle to faithfully transfer reasoning into visual content, highlighting a persistent gap between understanding and generation. The benchmark thus enables fine-grained analysis and guides future research toward integrated reasoning and generation in multimodal systems.

Abstract

Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at \href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench}.

Paper Structure

This paper contains 18 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Illustration examples of GIR-Bench, which highlight misalignments between the reasoning and generation capabilities of state-of-the-art unified multimodal models.
  • Figure 2: Examples of leading models on the GIR-Bench. Designed complex and various tasks pose challenges to current models.
  • Figure 3: Illustration of GIR-Bench-UGC. For each real-world entity, an implicit prompt drives text-to-image generation, while the corresponding real image is used for image understanding evaluation.
  • Figure 4: Performance decline from category inputs to implicit prompts.
  • Figure 5: Qualitative cases in GIR-Bench-UGC, showing both direct category inputs and implicit prompt inputs.
  • ...and 7 more figures