Table of Contents
Fetching ...

MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning

Prasham Yatinkumar Titiya, Jainil Trivedi, Chitta Baral, Vivek Gupta

TL;DR

MMTBENCH targets complex multimodal table reasoning by pairing real-world tables with embedded images and charts, creating a broad QA challenge. The authors compile 500 tables and 4,021 questions across diverse domains, with explicit, implicit, answer-mention, and visual-based queries and varied reasoning types. They present five baselines to dissect the role of vision and structured data and report that current LLM/VLM systems struggle with visual grounding and multi-step inference. The benchmark aims to drive progress toward practical systems capable of reasoning over intertwined textual and visual table data in real-world settings.

Abstract

Multimodal tables those that integrate semi structured data with visual elements such as charts and maps are ubiquitous across real world domains, yet they pose a formidable challenge to current vision language models (VLMs). While Large Language models (LLMs) and VLMs have demonstrated strong capabilities in text and image understanding, their performance on complex, real world multimodal table reasoning remains unexplored. To bridge this gap, we introduce MMTBENCH (Multimodal Table Benchmark), a benchmark consisting of 500 real world multimodal tables drawn from diverse real world sources, with a total of 4021 question answer pairs. MMTBENCH questions cover four question types (Explicit, Implicit, Answer Mention, and Visual Based), five reasoning types (Mathematical, Extrema Identification, Fact Verification, Vision Based, and Others), and eight table types (Single/Multiple Entity, Maps and Charts with Entities, Single/Multiple Charts, Maps, and Visualizations). Extensive evaluation of state of the art models on all types reveals substantial performance gaps, particularly on questions requiring visual-based reasoning and multi-step inference. These findings show the urgent need for improved architectures that more tightly integrate vision and language processing. By providing a challenging, high-quality resource that mirrors the complexity of real-world tasks, MMTBENCH underscores its value as a resource for future research on multimodal tables.

MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning

TL;DR

MMTBENCH targets complex multimodal table reasoning by pairing real-world tables with embedded images and charts, creating a broad QA challenge. The authors compile 500 tables and 4,021 questions across diverse domains, with explicit, implicit, answer-mention, and visual-based queries and varied reasoning types. They present five baselines to dissect the role of vision and structured data and report that current LLM/VLM systems struggle with visual grounding and multi-step inference. The benchmark aims to drive progress toward practical systems capable of reasoning over intertwined textual and visual table data in real-world settings.

Abstract

Multimodal tables those that integrate semi structured data with visual elements such as charts and maps are ubiquitous across real world domains, yet they pose a formidable challenge to current vision language models (VLMs). While Large Language models (LLMs) and VLMs have demonstrated strong capabilities in text and image understanding, their performance on complex, real world multimodal table reasoning remains unexplored. To bridge this gap, we introduce MMTBENCH (Multimodal Table Benchmark), a benchmark consisting of 500 real world multimodal tables drawn from diverse real world sources, with a total of 4021 question answer pairs. MMTBENCH questions cover four question types (Explicit, Implicit, Answer Mention, and Visual Based), five reasoning types (Mathematical, Extrema Identification, Fact Verification, Vision Based, and Others), and eight table types (Single/Multiple Entity, Maps and Charts with Entities, Single/Multiple Charts, Maps, and Visualizations). Extensive evaluation of state of the art models on all types reveals substantial performance gaps, particularly on questions requiring visual-based reasoning and multi-step inference. These findings show the urgent need for improved architectures that more tightly integrate vision and language processing. By providing a challenging, high-quality resource that mirrors the complexity of real-world tasks, MMTBENCH underscores its value as a resource for future research on multimodal tables.

Paper Structure

This paper contains 28 sections, 12 figures, 13 tables.

Figures (12)

  • Figure 1: A Multimodal Table in a Financial Context
  • Figure 2: Table Types Distribution
  • Figure 3: Answer Types Distribution
  • Figure 4: Reasoning Types Distribution
  • Figure 5: Question Types Distribution
  • ...and 7 more figures