Table of Contents
Fetching ...

ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering

Jingxuan Wei, Nan Xu, Junnan Zhu, Yanni Hao, Gaowei Wu, Bihui Yu, Lei Wang

TL;DR

ChartMind introduces the first benchmark for complex, real-world chart question answering with multilingual support and open-ended outputs. The authors propose ChartLLM, a context-driven framework that extracts key chart elements (titles, legends, axes) to guide reasoning, improving robustness across diverse chart types and languages. Across 14 multimodal models and seven task categories, ChartLLM outperforms instruction-following, OCR-enhanced, and chain-of-thought baselines, with strong alignment between GPT-4o scores and human judgments. The work highlights the importance of structured chart context for high-level reasoning and points to future directions such as multi-turn dialogues and cross-chart reasoning to better mirror real-world chart analysis.

Abstract

Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.

ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering

TL;DR

ChartMind introduces the first benchmark for complex, real-world chart question answering with multilingual support and open-ended outputs. The authors propose ChartLLM, a context-driven framework that extracts key chart elements (titles, legends, axes) to guide reasoning, improving robustness across diverse chart types and languages. Across 14 multimodal models and seven task categories, ChartLLM outperforms instruction-following, OCR-enhanced, and chain-of-thought baselines, with strong alignment between GPT-4o scores and human judgments. The work highlights the importance of structured chart context for high-level reasoning and points to future directions such as multi-turn dialogues and cross-chart reasoning to better mirror real-world chart analysis.

Abstract

Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.

Paper Structure

This paper contains 36 sections, 4 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Key Challenges in CQA Benchmarks: (A) Predominantly monolingual, limiting multilingual applicability in chart question answering; (B) Fixed formats and metrics, restricting adaptability to diverse charts; (C) Emphasis on deterministic answers, overlooking complex reasoning, such as trend analysis, and summarization.
  • Figure 2: Data Construction Pipeline for the ChartMind.
  • Figure 3: Language and task distribution in ChartMind.
  • Figure 4: Topic distribution in ChartMind.
  • Figure 5: Performance of multimodal models across Chinese and English datasets in ChartMind.
  • ...and 5 more figures