ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou
TL;DR
ChartAB tackles the problem of evaluating vision–language models on dense chart grounding and cross‑chart alignment by introducing a two‑stage inference pipeline and structured JSON evaluation templates. It builds a ChartX‑derived dataset with 9,000+ instances across nine chart types to systematically ground data and attributes and to measure two‑chart differences under controlled perturbations. The benchmark defines data and attribute grounding tasks, dense alignment tasks, robustness analyses, and QA correlations, providing novel metrics such as a key‑value alignment score and legend alignment in a 3×3 grid, all expressed with clear mathematical formulations. Empirical results reveal persistent weaknesses in current VLMs, especially on complex charts and color/text attributes, while demonstrating that intermediate grounding substantially improves dense alignment and downstream QA, offering concrete guidance for improving chart understanding in multimodal systems.
Abstract
Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.
