Table of Contents
Fetching ...

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

Bin Shan, Xiang Fei, Wei Shi, An-Lan Wang, Guozhi Tang, Lei Liao, Jingqun Tang, Xiang Bai, Can Huang

TL;DR

A Multimodal benchmark towards Text-rich visual scenes, to evaluate the Cognitive capabilities of MLLMs through visual reasoning and content-creation tasks (MCTBench), and an automatic evaluation pipeline to improve the efficiency and fairness of content-creation evaluation.

Abstract

The comprehension of text-rich visual scenes has become a focal point for evaluating Multi-modal Large Language Models (MLLMs) due to their widespread applications. Current benchmarks tailored to the scenario emphasize perceptual capabilities, while overlooking the assessment of cognitive abilities. To address this limitation, we introduce a Multimodal benchmark towards Text-rich visual scenes, to evaluate the Cognitive capabilities of MLLMs through visual reasoning and content-creation tasks (MCTBench). To mitigate potential evaluation bias from the varying distributions of datasets, MCTBench incorporates several perception tasks (e.g., scene text recognition) to ensure a consistent comparison of both the cognitive and perceptual capabilities of MLLMs. To improve the efficiency and fairness of content-creation evaluation, we conduct an automatic evaluation pipeline. Evaluations of various MLLMs on MCTBench reveal that, despite their impressive perceptual capabilities, their cognition abilities require enhancement. We hope MCTBench will offer the community an efficient resource to explore and enhance cognitive capabilities towards text-rich visual scenes.

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

TL;DR

A Multimodal benchmark towards Text-rich visual scenes, to evaluate the Cognitive capabilities of MLLMs through visual reasoning and content-creation tasks (MCTBench), and an automatic evaluation pipeline to improve the efficiency and fairness of content-creation evaluation.

Abstract

The comprehension of text-rich visual scenes has become a focal point for evaluating Multi-modal Large Language Models (MLLMs) due to their widespread applications. Current benchmarks tailored to the scenario emphasize perceptual capabilities, while overlooking the assessment of cognitive abilities. To address this limitation, we introduce a Multimodal benchmark towards Text-rich visual scenes, to evaluate the Cognitive capabilities of MLLMs through visual reasoning and content-creation tasks (MCTBench). To mitigate potential evaluation bias from the varying distributions of datasets, MCTBench incorporates several perception tasks (e.g., scene text recognition) to ensure a consistent comparison of both the cognitive and perceptual capabilities of MLLMs. To improve the efficiency and fairness of content-creation evaluation, we conduct an automatic evaluation pipeline. Evaluations of various MLLMs on MCTBench reveal that, despite their impressive perceptual capabilities, their cognition abilities require enhancement. We hope MCTBench will offer the community an efficient resource to explore and enhance cognitive capabilities towards text-rich visual scenes.

Paper Structure

This paper contains 28 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The Comparison between previous Benchmarks Singh_2019_CVPRliu2024hiddenli2024seedbench2plus, and our proposed MCTBench. Q and GT stand for question and ground truth.
  • Figure 2: The pipeline of constructing MCTBench.
  • Figure 3: Visualization of the question for three different tasks using word clouds. In the word cloud, the size of a word indicates how frequently it appears. Best viewed in color.
  • Figure 4: The cases of predication from different MLLMs divided into four groups: GPT-4V Achiam2023GPT4TR, Mini-Gemini li2024minigemini(MGM) and LLaVA-NeXT liu2024llavanext for larger model size, Monkey li2023monkey and mPLUG-DocOwl ye2023mplugdocowl for text-enhanced MLLMs, LLaVA-1.5 liu2023improvedllava and ShareGPT4V chen2023sharegpt4v for the general MLLMs
  • Figure 5: The cases of predication on content-creation tasks from three representative MLLMs: GPT-4V Achiam2023GPT4TR, Mini-Gemini li2024minigemini and Monkey li2023monkey. We mark high-quality sentences in red, words hit the text in the image with underlining, and rate the quality of the generation with stars.