EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding
Muye Huang, Han Lai, Xinyu Zhang, Wenjun Wu, Jie Ma, Lingling Zhang, Jun Liu
TL;DR
EvoChart tackles the gap between chart-style training data and real-world chart understanding by introducing a three-stage self-training pipeline that synthesizes high-quality, diverse chart data and an accompanying real-world benchmark, EvoChart-QA. The method alternates between compositional chart generation, chart evaluation/refinement, and QA-pair generation/training to produce progressively harder data and a stronger chart-understanding capability. Empirical results show EvoChart achieving 54.2% accuracy on EvoChart-QA (surpassing GPT-4o at 49.8%) and 81.5% on ChartQA, while also revealing that real-world chart understanding remains challenging for all models, especially on Complex Retrieval tasks and non-tabular chart types like Pie and Scatter. The work demonstrates the value of self-training with a refinement loop and provides a practical, multi-source benchmark that better reflects real-world chart understanding, with potential to guide future advances in visual-language chart reasoning.
Abstract
Chart understanding enables automated data analysis for humans, which requires models to achieve highly accurate visual comprehension. While existing Visual Language Models (VLMs) have shown progress in chart understanding, the lack of high-quality training data and comprehensive evaluation benchmarks hinders VLM chart comprehension. In this paper, we introduce EvoChart, a novel self-training method for generating synthetic chart data to enhance VLMs' capabilities in real-world chart comprehension. We also propose EvoChart-QA, a noval benchmark for measuring models' chart comprehension abilities in real-world scenarios. Specifically, EvoChart is a unique self-training data synthesis approach that simultaneously produces high-quality training corpus and a high-performance chart understanding model. EvoChart-QA consists of 650 distinct real-world charts collected from 140 different websites and 1,250 expert-curated questions that focus on chart understanding. Experimental results on various open-source and proprietary VLMs tested on EvoChart-QA demonstrate that even the best proprietary model, GPT-4o, achieves only 49.8% accuracy. Moreover, the EvoChart method significantly boosts the performance of open-source VLMs on real-world chart understanding tasks, achieving 54.2% accuracy on EvoChart-QA.
