UmniBench: Unified Understand and Generation Model Oriented Omni-dimensional Benchmark
Kai Liu, Leyang Chen, Wenbo Li, Zhikai Chen, Zhixin Wang, Renjing Pei, Linghe Kong, Yulun Zhang
TL;DR
UmniBench tackles the evaluation gap for unified multimodal models by introducing an omni-dimensional benchmark that jointly measures understanding, generation, and editing in a self-generated evaluation loop. The construction combines 13 domains, 15 concepts per domain, and 3 cases per concept, with a four-stage pipeline and a dedicated construction process to ensure high-quality, data-leakage-resistant assessments. Experimental results show competitive unified-model performance, clear degradation across stages, and strong alignment with human judgments, while also enabling single-ability analyses through controlled substitutions. The benchmark is designed to be self-contained, scalable, and comprehensive, aiming to guide and accelerate progress in unified multimodal modeling.
Abstract
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. However, evaluations of unified multimodal models (UMMs) remain decoupled, assessing their understanding and generation abilities separately with corresponding datasets. To address this, we propose UmniBench, a benchmark tailored for UMMs with omni-dimensional evaluation. First, UmniBench can assess the understanding, generation, and editing ability within a single evaluation process. Based on human-examined prompts and QA pairs, UmniBench leverages UMM itself to evaluate its generation and editing ability with its understanding ability. This simple but effective paradigm allows comprehensive evaluation of UMMs. Second, UmniBench covers 13 major domains and more than 200 concepts, ensuring a thorough inspection of UMMs. Moreover, UmniBench can also decouple and separately evaluate understanding, generation, and editing abilities, providing a fine-grained assessment. Based on UmniBench, we benchmark 24 popular models, including both UMMs and single-ability large models. We hope this benchmark provides a more comprehensive and objective view of unified models and logistical support for improving the performance of the community model.
