Table of Contents
Fetching ...

UmniBench: Unified Understand and Generation Model Oriented Omni-dimensional Benchmark

Kai Liu, Leyang Chen, Wenbo Li, Zhikai Chen, Zhixin Wang, Renjing Pei, Linghe Kong, Yulun Zhang

TL;DR

UmniBench tackles the evaluation gap for unified multimodal models by introducing an omni-dimensional benchmark that jointly measures understanding, generation, and editing in a self-generated evaluation loop. The construction combines 13 domains, 15 concepts per domain, and 3 cases per concept, with a four-stage pipeline and a dedicated construction process to ensure high-quality, data-leakage-resistant assessments. Experimental results show competitive unified-model performance, clear degradation across stages, and strong alignment with human judgments, while also enabling single-ability analyses through controlled substitutions. The benchmark is designed to be self-contained, scalable, and comprehensive, aiming to guide and accelerate progress in unified multimodal modeling.

Abstract

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. However, evaluations of unified multimodal models (UMMs) remain decoupled, assessing their understanding and generation abilities separately with corresponding datasets. To address this, we propose UmniBench, a benchmark tailored for UMMs with omni-dimensional evaluation. First, UmniBench can assess the understanding, generation, and editing ability within a single evaluation process. Based on human-examined prompts and QA pairs, UmniBench leverages UMM itself to evaluate its generation and editing ability with its understanding ability. This simple but effective paradigm allows comprehensive evaluation of UMMs. Second, UmniBench covers 13 major domains and more than 200 concepts, ensuring a thorough inspection of UMMs. Moreover, UmniBench can also decouple and separately evaluate understanding, generation, and editing abilities, providing a fine-grained assessment. Based on UmniBench, we benchmark 24 popular models, including both UMMs and single-ability large models. We hope this benchmark provides a more comprehensive and objective view of unified models and logistical support for improving the performance of the community model.

UmniBench: Unified Understand and Generation Model Oriented Omni-dimensional Benchmark

TL;DR

UmniBench tackles the evaluation gap for unified multimodal models by introducing an omni-dimensional benchmark that jointly measures understanding, generation, and editing in a self-generated evaluation loop. The construction combines 13 domains, 15 concepts per domain, and 3 cases per concept, with a four-stage pipeline and a dedicated construction process to ensure high-quality, data-leakage-resistant assessments. Experimental results show competitive unified-model performance, clear degradation across stages, and strong alignment with human judgments, while also enabling single-ability analyses through controlled substitutions. The benchmark is designed to be self-contained, scalable, and comprehensive, aiming to guide and accelerate progress in unified multimodal modeling.

Abstract

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. However, evaluations of unified multimodal models (UMMs) remain decoupled, assessing their understanding and generation abilities separately with corresponding datasets. To address this, we propose UmniBench, a benchmark tailored for UMMs with omni-dimensional evaluation. First, UmniBench can assess the understanding, generation, and editing ability within a single evaluation process. Based on human-examined prompts and QA pairs, UmniBench leverages UMM itself to evaluate its generation and editing ability with its understanding ability. This simple but effective paradigm allows comprehensive evaluation of UMMs. Second, UmniBench covers 13 major domains and more than 200 concepts, ensuring a thorough inspection of UMMs. Moreover, UmniBench can also decouple and separately evaluate understanding, generation, and editing abilities, providing a fine-grained assessment. Based on UmniBench, we benchmark 24 popular models, including both UMMs and single-ability large models. We hope this benchmark provides a more comprehensive and objective view of unified models and logistical support for improving the performance of the community model.

Paper Structure

This paper contains 15 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Advantages of our proposed UmniBench compared with previous UMMs isolated evaluation protocols.
  • Figure 2: The overview of UmniBench. All 13 domains involved in UmniBench are enumerated in the table above, with left-hand and right-hand panels presenting representative images generated under each concept under the specific domain.
  • Figure 3: The evaluation pipeline. There are three stages, including generation, interaction, and counterfactual stages. For each stage, there are two parts, i.e., the generation and the understanding part. UMMs take the prompt and image (if it exists) from the previous stage as input and generate a new image. Then, UMMs will be asked questions about the image to evaluate if the image follows the prompt.
  • Figure 4: An example of two cases in our proposed UmniBench. The UMMs generate or edit the image from the previous stage with specified generation, interaction, and counterfactual prompt. The images are placed in sequence while the prompts are provided below. Most of the models can provide a successful image in the generation stage, but fail on the counterfactual prompt.
  • Figure 5: Correlation with Human Evaluation. Each spot signifies a case in UmniBench, with the x-axis indicating the UmniBench Score and the y-axis for the human evaluation score.