Table of Contents
Fetching ...

EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts

Kushin Mukherjee, Donghao Ren, Dominik Moritz, Yannick Assogba

TL;DR

EncQA introduces a principled benchmark that isolates visual reasoning over chart encodings by systematically varying six encodings and eight chart tasks. Using a synthetic data pipeline, it yields 2,076 charts with 2,250 QA pairs and evaluates nine vision-language models, revealing strong encoding- and task-dependent variation and limited scaling benefits from larger models. Key findings show that legible encodings and simple visual mappings are easier for VLMs, while legend-dependent encodings, correlation judgments, and anomaly detection expose persistent gaps in visual reasoning. The work argues for targeted improvements in perceptual and encoding-aware training rather than relying solely on model scale, and it provides a framework for diagnosing specific perceptual weaknesses in chart understanding.

Abstract

Multimodal vision-language models (VLMs) continue to achieve ever-improving scores on chart understanding benchmarks. Yet, we find that this progress does not fully capture the breadth of visual reasoning capabilities essential for interpreting charts. We introduce EncQA, a novel benchmark informed by the visualization literature, designed to provide systematic coverage of visual encodings and analytic tasks that are crucial for chart understanding. EncQA provides 2,076 synthetic question-answer pairs, enabling balanced coverage of six visual encoding channels (position, length, area, color quantitative, color nominal, and shape) and eight tasks (find extrema, retrieve value, find anomaly, filter values, compute derived value exact, compute derived value relative, correlate values, and correlate values relative). Our evaluation of 9 state-of-the-art VLMs reveals that performance varies significantly across encodings within the same task, as well as across tasks. Contrary to expectations, we observe that performance does not improve with model size for many task-encoding pairs. Our results suggest that advancing chart understanding requires targeted strategies addressing specific visual reasoning gaps, rather than solely scaling up model or dataset size.

EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts

TL;DR

EncQA introduces a principled benchmark that isolates visual reasoning over chart encodings by systematically varying six encodings and eight chart tasks. Using a synthetic data pipeline, it yields 2,076 charts with 2,250 QA pairs and evaluates nine vision-language models, revealing strong encoding- and task-dependent variation and limited scaling benefits from larger models. Key findings show that legible encodings and simple visual mappings are easier for VLMs, while legend-dependent encodings, correlation judgments, and anomaly detection expose persistent gaps in visual reasoning. The work argues for targeted improvements in perceptual and encoding-aware training rather than relying solely on model scale, and it provides a framework for diagnosing specific perceptual weaknesses in chart understanding.

Abstract

Multimodal vision-language models (VLMs) continue to achieve ever-improving scores on chart understanding benchmarks. Yet, we find that this progress does not fully capture the breadth of visual reasoning capabilities essential for interpreting charts. We introduce EncQA, a novel benchmark informed by the visualization literature, designed to provide systematic coverage of visual encodings and analytic tasks that are crucial for chart understanding. EncQA provides 2,076 synthetic question-answer pairs, enabling balanced coverage of six visual encoding channels (position, length, area, color quantitative, color nominal, and shape) and eight tasks (find extrema, retrieve value, find anomaly, filter values, compute derived value exact, compute derived value relative, correlate values, and correlate values relative). Our evaluation of 9 state-of-the-art VLMs reveals that performance varies significantly across encodings within the same task, as well as across tasks. Contrary to expectations, we observe that performance does not improve with model size for many task-encoding pairs. Our results suggest that advancing chart understanding requires targeted strategies addressing specific visual reasoning gaps, rather than solely scaling up model or dataset size.

Paper Structure

This paper contains 28 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Example Question-Answer pair from ChartQA where the answer could be directly extracted from the text without making a visual judgment of the length of the bar with respect to the axis.
  • Figure 2: Model accuracy per visual encoding for each task. (R) indicates 'Relative' variants of the task. Red lines mark chance level performance for multiple choice questions. Error bars indicate 95% confidence intervals computed via the bootstrap method
  • Figure 3: Effect of correlation strength on accuracy for the Correlate Values task.
  • Figure 4: Effect of outlier type on accuracy for the Find Anomaly task.
  • Figure 5: Symmetric Mean Absolute Percentage Error for numeric responses (lower is better).
  • ...and 3 more figures