Table of Contents
Fetching ...

ChartAB: A Benchmark for Chart Grounding & Dense Alignment

Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou

TL;DR

ChartAB tackles the problem of evaluating vision–language models on dense chart grounding and cross‑chart alignment by introducing a two‑stage inference pipeline and structured JSON evaluation templates. It builds a ChartX‑derived dataset with 9,000+ instances across nine chart types to systematically ground data and attributes and to measure two‑chart differences under controlled perturbations. The benchmark defines data and attribute grounding tasks, dense alignment tasks, robustness analyses, and QA correlations, providing novel metrics such as a key‑value alignment score and legend alignment in a 3×3 grid, all expressed with clear mathematical formulations. Empirical results reveal persistent weaknesses in current VLMs, especially on complex charts and color/text attributes, while demonstrating that intermediate grounding substantially improves dense alignment and downstream QA, offering concrete guidance for improving chart understanding in multimodal systems.

Abstract

Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.

ChartAB: A Benchmark for Chart Grounding & Dense Alignment

TL;DR

ChartAB tackles the problem of evaluating vision–language models on dense chart grounding and cross‑chart alignment by introducing a two‑stage inference pipeline and structured JSON evaluation templates. It builds a ChartX‑derived dataset with 9,000+ instances across nine chart types to systematically ground data and attributes and to measure two‑chart differences under controlled perturbations. The benchmark defines data and attribute grounding tasks, dense alignment tasks, robustness analyses, and QA correlations, providing novel metrics such as a key‑value alignment score and legend alignment in a 3×3 grid, all expressed with clear mathematical formulations. Empirical results reveal persistent weaknesses in current VLMs, especially on complex charts and color/text attributes, while demonstrating that intermediate grounding substantially improves dense alignment and downstream QA, offering concrete guidance for improving chart understanding in multimodal systems.

Abstract

Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.

Paper Structure

This paper contains 29 sections, 5 equations, 17 figures, 3 tables, 3 algorithms.

Figures (17)

  • Figure 1: Examples of paired charts for ChartAB tasks.ChartAB evaluates dense grounding and alignment capabilities of VLMs on chart images. (1) Paired charts in each Data Grounding & Alignment task differ in a few visualized data values. (2) Paired charts in each Attribute Grounding & Alignment task differ in a visualization attribute, e.g., color, legend position, or text style. (3) Each Robustness task contains multiple variants of the same chart-pair for Data Alignment, with different attributes (e.g., colors) across the variants.
  • Figure 2: Statistics of ChartAB.ChartAB includes 9,000+ instances curated for tasks below: (1) Paired charts for Data Grounding & Alignment differ in one to three data cells; (2) Paired charts for Attribute Grounding & Alignment differ in color, legend position, or text style; (3) Robustness task includes multiple pairs that share identical differences in data but differ in certain attributes.
  • Figure 3: Two-stage color alignment by o4-mini. The o4-mini model automatically decomposes the task into a grounding step for the colors in each chart, followed by an output prediction of the alignment. This two-stage reasoning yields a more accurate result than GPT-4o, which performs alignment directly without intermediate grounding.
  • Figure 4: Two-Stage Evaluation Pipeline for Data Grounding & Alignment in ChartAB. The first stage focuses on grounding the data visualized by each chart in a CSV table, while the second stage focuses on alignment, which aims to allocate the difference between the two tables and output a JSON file listing the different cells. The other two categories of tasks in ChartAB also adopt similar multi-stage pipelines, detailed in Figures \ref{['fig:color_alignment_inference_pipeline']}, \ref{['fig:text_style_alignment_inference_pipeline']}, \ref{['fig:legend_alignment_inference_pipeline']} of the Appendix.
  • Figure 5: Left: Comparing VLMs on Data Alignment tasks on paried charts with one-cell difference. Llava-1.6 performs worse than most other VLMs, while QWEN-2.5-VL outperforms GPT-4o on most chart types. Right: Color alignment on fine-grained visual elements (e.g., bars, lines, sectors) between two charts. Most VLMs perform better on simpler and more common charts, e.g., line/bar charts. Related discussion beneath Finding 1.
  • ...and 12 more figures