Table of Contents
Fetching ...

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen

TL;DR

CharXiv targets realistic chart understanding by assembling 2,323 real-world arXiv charts with manually crafted descriptive and reasoning questions. It reveals a sizable gap between the top proprietary models (GPT-4o) and open-source alternatives (e.g., InternVL), and a larger gap to human performance, underscoring weaknesses in current MLLMs for chart reasoning. The benchmark’s careful chart curation, real-world diversity, and human-validated ground truth aim to provide a faithful measure of progress and a stress test against perturbations. This work highlights the need for robust, domain-diverse chart understanding benchmarks to drive real-world capabilities in multimodal language models.

Abstract

Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

TL;DR

CharXiv targets realistic chart understanding by assembling 2,323 real-world arXiv charts with manually crafted descriptive and reasoning questions. It reveals a sizable gap between the top proprietary models (GPT-4o) and open-source alternatives (e.g., InternVL), and a larger gap to human performance, underscoring weaknesses in current MLLMs for chart reasoning. The benchmark’s careful chart curation, real-world diversity, and human-validated ground truth aim to provide a faithful measure of progress and a stress test against perturbations. This work highlights the need for robust, domain-diverse chart understanding benchmarks to drive real-world capabilities in multimodal language models.

Abstract

Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/

Paper Structure

This paper contains 124 sections, 3 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Example chart (left), descriptive questions (top-right) and reasoning questions (bottom-right) in CharXiv31,119,180179,26,26 where open-source models even fail in basic descriptive questions. Moreover, all models struggle with correctly answering the reasoning question.
  • Figure 2: Model performance comparison on reasoning questions from CharXiv31,119,180179,26,26 v.s. questions from existing benchmarks. As indicated by the red and blue bars resepctively, many open-source models surpass proprietary model performance on the $174$ sample questions from existing benchmarks (subsets of DVQA, FigureQA and ChartQA from the testmini split of MathVista) yet fail consistently on the $1000$ reasoning questions from the validation split of CharXiv31,119,180179,26,26.
  • Figure 3: Open-source models generalize poorly to modified examples (measured by accuracy). Left: original set against modified-question set. Right: original set against modified-chart set.
  • Figure 4: Metadata breakdown of charts, descriptive questions, and reasoning questions in CharXiv31,119,180179,26,26.
  • Figure 5: Analysis on unanswerable questions (a) and charts with subplots (b).
  • ...and 5 more figures