Table of Contents
Fetching ...

MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems

Zifeng Zhu, Mengzhao Jia, Zhihan Zhang, Lang Li, Meng Jiang

TL;DR

MultiChartQA tackles the gap in evaluating vision-language systems on real-world multi-chart reasoning by assembling a large, semantically coherent collection of charts from public sources and defining four reasoning tasks: direct, parallel, comparative, and sequential. The benchmark evaluates 20 MLLMs and reveals substantial gaps to human performance, with chain-of-thought prompting providing notable gains and chart-reference cues aiding information localization. Findings show sequential and cross-chart reasoning remain particularly challenging, and merging charts or omitting references can degrade performance. This work establishes a targeted, domain-specific benchmark to drive advancements in multi-chart understanding for future research and applications.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive abilities across various tasks, including visual question answering and chart comprehension, yet existing benchmarks for chart-related tasks fall short in capturing the complexity of real-world multi-chart scenarios. Current benchmarks primarily focus on single-chart tasks, neglecting the multi-hop reasoning required to extract and integrate information from multiple charts, which is essential in practical applications. To fill this gap, we introduce MultiChartQA, a benchmark that evaluates MLLMs' capabilities in four key areas: direct question answering, parallel question answering, comparative reasoning, and sequential reasoning. Our evaluation of a wide range of MLLMs reveals significant performance gaps compared to humans. These results highlight the challenges in multi-chart comprehension and the potential of MultiChartQA to drive advancements in this field. Our code and data are available at https://github.com/Zivenzhu/Multi-chart-QA

MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems

TL;DR

MultiChartQA tackles the gap in evaluating vision-language systems on real-world multi-chart reasoning by assembling a large, semantically coherent collection of charts from public sources and defining four reasoning tasks: direct, parallel, comparative, and sequential. The benchmark evaluates 20 MLLMs and reveals substantial gaps to human performance, with chain-of-thought prompting providing notable gains and chart-reference cues aiding information localization. Findings show sequential and cross-chart reasoning remain particularly challenging, and merging charts or omitting references can degrade performance. This work establishes a targeted, domain-specific benchmark to drive advancements in multi-chart understanding for future research and applications.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive abilities across various tasks, including visual question answering and chart comprehension, yet existing benchmarks for chart-related tasks fall short in capturing the complexity of real-world multi-chart scenarios. Current benchmarks primarily focus on single-chart tasks, neglecting the multi-hop reasoning required to extract and integrate information from multiple charts, which is essential in practical applications. To fill this gap, we introduce MultiChartQA, a benchmark that evaluates MLLMs' capabilities in four key areas: direct question answering, parallel question answering, comparative reasoning, and sequential reasoning. Our evaluation of a wide range of MLLMs reveals significant performance gaps compared to humans. These results highlight the challenges in multi-chart comprehension and the potential of MultiChartQA to drive advancements in this field. Our code and data are available at https://github.com/Zivenzhu/Multi-chart-QA

Paper Structure

This paper contains 33 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: An illustration of a multi-chart question is presented, asking for a comparison between the performance of two optimizers under specific conditions. The model is required to perform multi-hop reasoning across different charts to arrive at the correct answer. This scenario is frequently encountered in real-world applications.
  • Figure 2: Multi-ChartQA contains four types of QA tasks, covering four crucial abilities for understanding and reasoning across multiple charts. We highlight the key information location for answering each question with boxes and circles. The arrows represent the multi-step reasoning process across different charts.
  • Figure 3: Detailed illustration of question categories. MultiChartQA features four distinct types of questions, varying in form, content, and difficulty. For brevity, the category names are abbreviated. Struct.: Structure, Comp.:Comparative, and Seq.: Sequential.
  • Figure 4: The accuracy of the 18 MLLMs is evaluated under three settings: the original setting, merged charts, and without Chain-of-Thought reasoning. Most models exhibit a decline in performance when processing merged charts or when answering without conducting CoT.
  • Figure 5: An example of error analysis for a chart set is illustrated, with the corresponding question, Claude-3.5-Sonnet output, and correct answer shown on the right.
  • ...and 5 more figures