Table of Contents
Fetching ...

Knowledge-Aware Reasoning over Multimodal Semi-structured Tables

Suyash Vardhan Mathur, Jainit Sushil Bafna, Kunal Kartik, Harshita Khandelwal, Manish Shrivastava, Vivek Gupta, Mohit Bansal, Dan Roth

TL;DR

This work investigates knowledge-aware reasoning over multimodal semi-structured tables by introducing MMTabQA, a dataset created by replacing textual table entities with representative images from Wikipedia/Wikidata. The authors categorize questions into explicit, answer-mention, and implicit types and add synthetic visual questions to probe visual understanding and cross-modal reasoning. A comprehensive suite of baselines and modeling strategies, including partial-input, captioning, table-imaged, interleaved, and oracle-replaced approaches, reveals substantial gaps in current Vision-Language models, particularly in entity disambiguation and visual-attribute reasoning, with closed-source models like GPT-4o and Gemini-1.5 Flash achieving the best performance but still under upper-bound potential. The dataset serves as a robust benchmark for advancing multimodal table reasoning and highlights practical implications for healthcare, education, and e-commerce where multimodal tables are prevalent.

Abstract

Existing datasets for tabular question answering typically focus exclusively on text within cells. However, real-world data is inherently multimodal, often blending images such as symbols, faces, icons, patterns, and charts with textual content in tables. With the evolution of AI models capable of multimodal reasoning, it is pertinent to assess their efficacy in handling such structured data. This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We explore their ability to reason on tables that integrate both images and text, introducing MMTabQA, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs, understanding visual context, and comparing visual content across images. These findings establish our dataset as a robust benchmark for advancing AI's comprehension and capabilities in analyzing multimodal structured data.

Knowledge-Aware Reasoning over Multimodal Semi-structured Tables

TL;DR

This work investigates knowledge-aware reasoning over multimodal semi-structured tables by introducing MMTabQA, a dataset created by replacing textual table entities with representative images from Wikipedia/Wikidata. The authors categorize questions into explicit, answer-mention, and implicit types and add synthetic visual questions to probe visual understanding and cross-modal reasoning. A comprehensive suite of baselines and modeling strategies, including partial-input, captioning, table-imaged, interleaved, and oracle-replaced approaches, reveals substantial gaps in current Vision-Language models, particularly in entity disambiguation and visual-attribute reasoning, with closed-source models like GPT-4o and Gemini-1.5 Flash achieving the best performance but still under upper-bound potential. The dataset serves as a robust benchmark for advancing multimodal table reasoning and highlights practical implications for healthcare, education, and e-commerce where multimodal tables are prevalent.

Abstract

Existing datasets for tabular question answering typically focus exclusively on text within cells. However, real-world data is inherently multimodal, often blending images such as symbols, faces, icons, patterns, and charts with textual content in tables. With the evolution of AI models capable of multimodal reasoning, it is pertinent to assess their efficacy in handling such structured data. This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We explore their ability to reason on tables that integrate both images and text, introducing MMTabQA, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs, understanding visual context, and comparing visual content across images. These findings establish our dataset as a robust benchmark for advancing AI's comprehension and capabilities in analyzing multimodal structured data.
Paper Structure (43 sections, 12 figures, 15 tables)

This paper contains 43 sections, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Multimodal Table comparing iPhone features
  • Figure 2: Dataset Creation Pipeline
  • Figure 3: Clockwise from left - (a): Table about College Football, (b): Table about College Enrollment, (c): Table about 1984 Central American Games, (d): Table about International Football.
  • Figure 4: WikiTableQuestions Dataset Example
  • Figure 5: WikiSQL Dataset Example
  • ...and 7 more figures