Table of Contents
Fetching ...

OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding

Jiancong Xie, Wenjin Wang, Zhuomeng Zhang, Zihan Liu, Qi Liu, Ke Feng, Zixun Sun, Yuedong Yang

TL;DR

OIG-Bench targets the underexplored task of human-like understanding in multimodal models using One-Image Guides. It introduces a bilingual 808-image benchmark constructed via a semi-automated multi-agent annotation pipeline, with two evaluation tasks (description generation and VQA) and 29 MLLMs assessed across five capability dimensions. The study finds that Qwen2.5-VL-72B achieves the best overall performance yet all models struggle with semantic consistency and complex reasoning, while the proposed multi-agent annotation system delivers superior captioning quality at the cost of higher latency—mitigated by selective agent removal. These results illuminate current limitations and offer a practical pathway for dataset construction and model improvement to approximate human-like visual understanding in information-dense multimodal contexts.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities. However, evaluating their capacity for human-like understanding in One-Image Guides remains insufficiently explored. One-Image Guides are a visual format combining text, imagery, and symbols to present reorganized and structured information for easier comprehension, which are specifically designed for human viewing and inherently embody the characteristics of human perception and understanding. Here, we present OIG-Bench, a comprehensive benchmark focused on One-Image Guide understanding across diverse domains. To reduce the cost of manual annotation, we developed a semi-automated annotation pipeline in which multiple intelligent agents collaborate to generate preliminary image descriptions, assisting humans in constructing image-text pairs. With OIG-Bench, we have conducted a comprehensive evaluation of 29 state-of-the-art MLLMs, including both proprietary and open-source models. The results show that Qwen2.5-VL-72B performs the best among the evaluated models, with an overall accuracy of 77%. Nevertheless, all models exhibit notable weaknesses in semantic understanding and logical reasoning, indicating that current MLLMs still struggle to accurately interpret complex visual-text relationships. In addition, we also demonstrate that the proposed multi-agent annotation system outperforms all MLLMs in image captioning, highlighting its potential as both a high-quality image description generator and a valuable tool for future dataset construction. Datasets are available at https://github.com/XiejcSYSU/OIG-Bench.

OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding

TL;DR

OIG-Bench targets the underexplored task of human-like understanding in multimodal models using One-Image Guides. It introduces a bilingual 808-image benchmark constructed via a semi-automated multi-agent annotation pipeline, with two evaluation tasks (description generation and VQA) and 29 MLLMs assessed across five capability dimensions. The study finds that Qwen2.5-VL-72B achieves the best overall performance yet all models struggle with semantic consistency and complex reasoning, while the proposed multi-agent annotation system delivers superior captioning quality at the cost of higher latency—mitigated by selective agent removal. These results illuminate current limitations and offer a practical pathway for dataset construction and model improvement to approximate human-like visual understanding in information-dense multimodal contexts.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities. However, evaluating their capacity for human-like understanding in One-Image Guides remains insufficiently explored. One-Image Guides are a visual format combining text, imagery, and symbols to present reorganized and structured information for easier comprehension, which are specifically designed for human viewing and inherently embody the characteristics of human perception and understanding. Here, we present OIG-Bench, a comprehensive benchmark focused on One-Image Guide understanding across diverse domains. To reduce the cost of manual annotation, we developed a semi-automated annotation pipeline in which multiple intelligent agents collaborate to generate preliminary image descriptions, assisting humans in constructing image-text pairs. With OIG-Bench, we have conducted a comprehensive evaluation of 29 state-of-the-art MLLMs, including both proprietary and open-source models. The results show that Qwen2.5-VL-72B performs the best among the evaluated models, with an overall accuracy of 77%. Nevertheless, all models exhibit notable weaknesses in semantic understanding and logical reasoning, indicating that current MLLMs still struggle to accurately interpret complex visual-text relationships. In addition, we also demonstrate that the proposed multi-agent annotation system outperforms all MLLMs in image captioning, highlighting its potential as both a high-quality image description generator and a valuable tool for future dataset construction. Datasets are available at https://github.com/XiejcSYSU/OIG-Bench.

Paper Structure

This paper contains 27 sections, 1 equation, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Left: An example of a One-Image Guide. Middle: Three major challenges for MLLMs in understanding One-Image Guides. Right: Results of six representative MLLMs and the proposed multi-agent annotation system. The left-skewed distribution indicates that MLLMs still perform poorly in logical reasoning and semantic understanding. In contrast, the multi-agent annotation system achieves the best overall performance.
  • Figure 2: The One-Image Guide Data Processing Pipeline. We collect One-Image Guide images from the Internet and filter them using a cross-examination process with multiple MLLMs, where one model poses questions and others answer. Images with accuracy above 0.8 are considered too simple and removed. Next, we annotate the images using a multi-agent system, followed by manual verification. In the evaluation stage, we assess MLLMs on two tasks, namely description generation and VQA, covering five key capabilities.
  • Figure 3: Statistics of OIG-Bench. (a) Domain distribution: Game (48.8%), Travel (28.7%), Food (9.1%), Medicine (8.9%), and Other. (b) Diagram type distribution: Layout Diagrams (70.5%) and Logic Diagrams (29.5%). (c) Kernel density estimation of image entropy, comparing OIG-Bench with SEED-Bench-2-Plus and InfographicVQA.
  • Figure 4: Performance comparison of two representative MLLMs across different image domains and types.
  • Figure 5: Case analysis of model errors.
  • ...and 6 more figures