Rethinking Comprehensive Benchmark for Chart Understanding: A Perspective from Scientific Literature
Lingdong Shen, Qigqi, Kun Ding, Gaofeng Meng, Shiming Xiang
TL;DR
This work revisits chart understanding evaluation by introducing SCI-CQA, a benchmark grounded in scientific literature that emphasizes real, varied chart types—including flowcharts—and context-rich reasoning. It assembles 37,607 high-quality chart-caption-context triplets from 15 conferences and 5,629 exam-style questions to test perception and complex reasoning, with an automated annotation pipeline to scale data efficiently. The study demonstrates that existing models struggle with flowcharts and context-enabled questions, while context and chart-type diversity substantially improve assessment realism. By comparing open-source and proprietary models, and by incorporating human-like exam formats, SCI-CQA provides a rigorous, scalable framework for advancing multimodal chart understanding with practical scientific relevance.
Abstract
Scientific Literature charts often contain complex visual elements, including multi-plot figures, flowcharts, structural diagrams and etc. Evaluating multimodal models using these authentic and intricate charts provides a more accurate assessment of their understanding abilities. However, existing benchmarks face limitations: a narrow range of chart types, overly simplistic template-based questions and visual elements, and inadequate evaluation methods. These shortcomings lead to inflated performance scores that fail to hold up when models encounter real-world scientific charts. To address these challenges, we introduce a new benchmark, Scientific Chart QA (SCI-CQA), which emphasizes flowcharts as a critical yet often overlooked category. To overcome the limitations of chart variety and simplistic visual elements, we curated a dataset of 202,760 image-text pairs from 15 top-tier computer science conferences papers over the past decade. After rigorous filtering, we refined this to 37,607 high-quality charts with contextual information. SCI-CQA also introduces a novel evaluation framework inspired by human exams, encompassing 5,629 carefully curated questions, both objective and open-ended. Additionally, we propose an efficient annotation pipeline that significantly reduces data annotation costs. Finally, we explore context-based chart understanding, highlighting the crucial role of contextual information in solving previously unanswerable questions.
