Table of Contents
Fetching ...

Benchmarking Visual Language Models on Standardized Visualization Literacy Tests

Saugat Pandey, Alvitta Ottley

TL;DR

This paper benchmarks four leading Visual Language Models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.2 Vision 11B) on standardized visualization-literacy tests VLAT and CALVI to assess reading, interpretation, and critical thinking about misleading visualizations. Using a randomized trial design with ten runs per model and structured prompting, it reveals that Claude achieves the strongest basic visualization interpretation (VLAT) while all models struggle significantly with detecting misleaders (CALVI). The results show robust performance on simple chart types and hierarchical structures but notable weaknesses on data-dense encodings, spatial reasoning, and deception detection, highlighting gaps in current VLM architectures. The work provides a reproducible evaluation framework and actionable guidance for integrating VLMs into visualization systems with human oversight.

Abstract

The increasing integration of Visual Language Models (VLMs) into visualization systems demands a comprehensive understanding of their visual interpretation capabilities and constraints. While existing research has examined individual models, systematic comparisons of VLMs' visualization literacy remain unexplored. We bridge this gap through a rigorous, first-of-its-kind evaluation of four leading VLMs (GPT-4, Claude, Gemini, and Llama) using standardized assessments: the Visualization Literacy Assessment Test (VLAT) and Critical Thinking Assessment for Literacy in Visualizations (CALVI). Our methodology uniquely combines randomized trials with structured prompting techniques to control for order effects and response variability - a critical consideration overlooked in many VLM evaluations. Our analysis reveals that while specific models demonstrate competence in basic chart interpretation (Claude achieving 67.9% accuracy on VLAT), all models exhibit substantial difficulties in identifying misleading visualization elements (maximum 30.0\% accuracy on CALVI). We uncover distinct performance patterns: strong capabilities in interpreting conventional charts like line charts (76-96% accuracy) and detecting hierarchical structures (80-100% accuracy), but consistent difficulties with data-dense visualizations involving multiple encodings (bubble charts: 18.6-61.4%) and anomaly detection (25-30% accuracy). Significantly, we observe distinct uncertainty management behavior across models, with Gemini displaying heightened caution (22.5% question omission) compared to others (7-8%). These findings provide crucial insights for the visualization community by establishing reliable VLM evaluation benchmarks, identifying areas where current models fall short, and highlighting the need for targeted improvements in VLM architectures for visualization tasks.

Benchmarking Visual Language Models on Standardized Visualization Literacy Tests

TL;DR

This paper benchmarks four leading Visual Language Models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.2 Vision 11B) on standardized visualization-literacy tests VLAT and CALVI to assess reading, interpretation, and critical thinking about misleading visualizations. Using a randomized trial design with ten runs per model and structured prompting, it reveals that Claude achieves the strongest basic visualization interpretation (VLAT) while all models struggle significantly with detecting misleaders (CALVI). The results show robust performance on simple chart types and hierarchical structures but notable weaknesses on data-dense encodings, spatial reasoning, and deception detection, highlighting gaps in current VLM architectures. The work provides a reproducible evaluation framework and actionable guidance for integrating VLMs into visualization systems with human oversight.

Abstract

The increasing integration of Visual Language Models (VLMs) into visualization systems demands a comprehensive understanding of their visual interpretation capabilities and constraints. While existing research has examined individual models, systematic comparisons of VLMs' visualization literacy remain unexplored. We bridge this gap through a rigorous, first-of-its-kind evaluation of four leading VLMs (GPT-4, Claude, Gemini, and Llama) using standardized assessments: the Visualization Literacy Assessment Test (VLAT) and Critical Thinking Assessment for Literacy in Visualizations (CALVI). Our methodology uniquely combines randomized trials with structured prompting techniques to control for order effects and response variability - a critical consideration overlooked in many VLM evaluations. Our analysis reveals that while specific models demonstrate competence in basic chart interpretation (Claude achieving 67.9% accuracy on VLAT), all models exhibit substantial difficulties in identifying misleading visualization elements (maximum 30.0\% accuracy on CALVI). We uncover distinct performance patterns: strong capabilities in interpreting conventional charts like line charts (76-96% accuracy) and detecting hierarchical structures (80-100% accuracy), but consistent difficulties with data-dense visualizations involving multiple encodings (bubble charts: 18.6-61.4%) and anomaly detection (25-30% accuracy). Significantly, we observe distinct uncertainty management behavior across models, with Gemini displaying heightened caution (22.5% question omission) compared to others (7-8%). These findings provide crucial insights for the visualization community by establishing reliable VLM evaluation benchmarks, identifying areas where current models fall short, and highlighting the need for targeted improvements in VLM architectures for visualization tasks.

Paper Structure

This paper contains 27 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: VLAT (left) and CALVI (right) prompt templates used for VLM evaluation.
  • Figure 2: Model performance comparison across different visualization types in vlat. Each plot shows mean accuracy with 95% confidence intervals, comparing VLMs against human performance.
  • Figure 3: Model performance comparison across different visualization tasks in vlat. Each plot shows mean accuracy with 95% confidence intervals, comparing VLMs against human performance.
  • Figure 4: Comparison of VLM and human performance across different misleader types in calvi assessment. Points show mean accuracy with 95% confidence intervals. MS = Manipulation of Scales.