MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems
Zifeng Zhu, Mengzhao Jia, Zhihan Zhang, Lang Li, Meng Jiang
TL;DR
MultiChartQA tackles the gap in evaluating vision-language systems on real-world multi-chart reasoning by assembling a large, semantically coherent collection of charts from public sources and defining four reasoning tasks: direct, parallel, comparative, and sequential. The benchmark evaluates 20 MLLMs and reveals substantial gaps to human performance, with chain-of-thought prompting providing notable gains and chart-reference cues aiding information localization. Findings show sequential and cross-chart reasoning remain particularly challenging, and merging charts or omitting references can degrade performance. This work establishes a targeted, domain-specific benchmark to drive advancements in multi-chart understanding for future research and applications.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive abilities across various tasks, including visual question answering and chart comprehension, yet existing benchmarks for chart-related tasks fall short in capturing the complexity of real-world multi-chart scenarios. Current benchmarks primarily focus on single-chart tasks, neglecting the multi-hop reasoning required to extract and integrate information from multiple charts, which is essential in practical applications. To fill this gap, we introduce MultiChartQA, a benchmark that evaluates MLLMs' capabilities in four key areas: direct question answering, parallel question answering, comparative reasoning, and sequential reasoning. Our evaluation of a wide range of MLLMs reveals significant performance gaps compared to humans. These results highlight the challenges in multi-chart comprehension and the potential of MultiChartQA to drive advancements in this field. Our code and data are available at https://github.com/Zivenzhu/Multi-chart-QA
