Table of Contents
Fetching ...

Rethinking Comprehensive Benchmark for Chart Understanding: A Perspective from Scientific Literature

Lingdong Shen, Qigqi, Kun Ding, Gaofeng Meng, Shiming Xiang

TL;DR

This work revisits chart understanding evaluation by introducing SCI-CQA, a benchmark grounded in scientific literature that emphasizes real, varied chart types—including flowcharts—and context-rich reasoning. It assembles 37,607 high-quality chart-caption-context triplets from 15 conferences and 5,629 exam-style questions to test perception and complex reasoning, with an automated annotation pipeline to scale data efficiently. The study demonstrates that existing models struggle with flowcharts and context-enabled questions, while context and chart-type diversity substantially improve assessment realism. By comparing open-source and proprietary models, and by incorporating human-like exam formats, SCI-CQA provides a rigorous, scalable framework for advancing multimodal chart understanding with practical scientific relevance.

Abstract

Scientific Literature charts often contain complex visual elements, including multi-plot figures, flowcharts, structural diagrams and etc. Evaluating multimodal models using these authentic and intricate charts provides a more accurate assessment of their understanding abilities. However, existing benchmarks face limitations: a narrow range of chart types, overly simplistic template-based questions and visual elements, and inadequate evaluation methods. These shortcomings lead to inflated performance scores that fail to hold up when models encounter real-world scientific charts. To address these challenges, we introduce a new benchmark, Scientific Chart QA (SCI-CQA), which emphasizes flowcharts as a critical yet often overlooked category. To overcome the limitations of chart variety and simplistic visual elements, we curated a dataset of 202,760 image-text pairs from 15 top-tier computer science conferences papers over the past decade. After rigorous filtering, we refined this to 37,607 high-quality charts with contextual information. SCI-CQA also introduces a novel evaluation framework inspired by human exams, encompassing 5,629 carefully curated questions, both objective and open-ended. Additionally, we propose an efficient annotation pipeline that significantly reduces data annotation costs. Finally, we explore context-based chart understanding, highlighting the crucial role of contextual information in solving previously unanswerable questions.

Rethinking Comprehensive Benchmark for Chart Understanding: A Perspective from Scientific Literature

TL;DR

This work revisits chart understanding evaluation by introducing SCI-CQA, a benchmark grounded in scientific literature that emphasizes real, varied chart types—including flowcharts—and context-rich reasoning. It assembles 37,607 high-quality chart-caption-context triplets from 15 conferences and 5,629 exam-style questions to test perception and complex reasoning, with an automated annotation pipeline to scale data efficiently. The study demonstrates that existing models struggle with flowcharts and context-enabled questions, while context and chart-type diversity substantially improve assessment realism. By comparing open-source and proprietary models, and by incorporating human-like exam formats, SCI-CQA provides a rigorous, scalable framework for advancing multimodal chart understanding with practical scientific relevance.

Abstract

Scientific Literature charts often contain complex visual elements, including multi-plot figures, flowcharts, structural diagrams and etc. Evaluating multimodal models using these authentic and intricate charts provides a more accurate assessment of their understanding abilities. However, existing benchmarks face limitations: a narrow range of chart types, overly simplistic template-based questions and visual elements, and inadequate evaluation methods. These shortcomings lead to inflated performance scores that fail to hold up when models encounter real-world scientific charts. To address these challenges, we introduce a new benchmark, Scientific Chart QA (SCI-CQA), which emphasizes flowcharts as a critical yet often overlooked category. To overcome the limitations of chart variety and simplistic visual elements, we curated a dataset of 202,760 image-text pairs from 15 top-tier computer science conferences papers over the past decade. After rigorous filtering, we refined this to 37,607 high-quality charts with contextual information. SCI-CQA also introduces a novel evaluation framework inspired by human exams, encompassing 5,629 carefully curated questions, both objective and open-ended. Additionally, we propose an efficient annotation pipeline that significantly reduces data annotation costs. Finally, we explore context-based chart understanding, highlighting the crucial role of contextual information in solving previously unanswerable questions.

Paper Structure

This paper contains 21 sections, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Comparison of SCI-CQA with previous work. The black box represents prior work, the green box highlights the SCI-CQA flowchart section, and the orange box represents the SCI-CQA data chart section. This comparison underscores SCI-CQA's advancements in data diversity and question complexity.
  • Figure 2: SCI-CQA data processing pipeline, multiple proprietary models is employed to monitor and enhance data quality, supplemented by rigorous manual verification at key stages, whichs led to the creation of a high-quality evaluation dataset.
  • Figure 3: Data statistics (left) and automated annotation tool pipeline (right). High-quality annotated data and scoring data from model evaluations are used to train the open-source model, which iteratively generates additional pseudo-labeled data.
  • Figure 4: Impact of refined context and captions on answering previously unanswerable questions.
  • Figure 5: Generate question-and-answer pairs based solely on the provided flowchart image.
  • ...and 11 more figures