UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Zimo Wen; Boxiu Li; Wanbo Zhang; Junxiang Lei; Xiaoyu Chen; Yijia Fan; Qi Zhang; Yujiang Wang; Lili Qiu; Bo Li; Ziwei Liu; Caihua Shan; Yifan Yang; Yifei Shen

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, Ziwei Liu, Caihua Shan, Yifan Yang, Yifei Shen

TL;DR

The UniG2U-Bench is introduced, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations, highlighting the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

Abstract

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

TL;DR

Abstract

Paper Structure (161 sections, 10 equations, 48 figures, 14 tables)

This paper contains 161 sections, 10 equations, 48 figures, 14 tables.

Introduction
Related Works
Unified Multimodal Models
Benchmarks for Unified Models
Formulations and Definitions
Definitions: UMMs and G2U
Task setting.
Pure understanding models.
Unified multimodal models (UMMs).
Defining "Generation Helps Understanding" (G2U).
Taxonomy of Unified Multimodal Models
(1) End-to-end unified models (E2E).
(2) Decoupled unified systems (Decoupled).
(3) Agentic unified models (UM-Ag).
Direct vs. Generate-then-Answer (GtA) Inference
...and 146 more sections

Figures (48)

Figure 1: Model Performance Radar Chart
Figure 2: Taxonomy of unified multimodal models (UMMs). All models annotated in the figure are benchmarked in this work.
Figure 3: Task taxonomy overview of UniG2U.
Figure 4: Task selection and coverage in UniG2U.
Figure 5: Overview of our two evaluation prompt formats: Direct and Generate-then-Answer (GtA).
...and 43 more figures

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

TL;DR

Abstract

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Authors

TL;DR

Abstract

Table of Contents

Figures (48)