Table of Contents
Fetching ...

CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models

Arnav Verma, Kushin Mukherjee, Christopher Potts, Elisa Kreiss, Judith E. Fan

TL;DR

CHART-6 introduces a human-centered benchmark to evaluate data-visualization understanding by comparing eight vision-language models with human participants across six tests. The authors implement a rigorous evaluation protocol, testing models on 851 items (GGR, VLAT, CALVI, HOLF, HOLF-Multi, ChartQA-Human) and analyzing validity, accuracy, and error patterns relative to humans. Across results, AI models underperform humans on average, with no model approaching the human noise ceiling, though GPT-4V often yields the best performance among models and shows partial alignment in relative strengths. The findings highlight gaps in mechanistic modeling of human visualization reasoning and propose future directions for unified measures, adaptive testing, and more human-aligned learning signals to advance cognitive benchmarking. The work provides open resources for reproducibility and positions CHART-6 as a platform to track progress toward human-like graphical reasoning in AI.

Abstract

Data visualizations are powerful tools for communicating patterns in quantitative data. Yet understanding any data visualization is no small feat -- succeeding requires jointly making sense of visual, numerical, and linguistic inputs arranged in a conventionalized format one has previously learned to parse. Recently developed vision-language models are, in principle, promising candidates for developing computational models of these cognitive operations. However, it is currently unclear to what degree these models emulate human behavior on tasks that involve reasoning about data visualizations. This gap reflects limitations in prior work that has evaluated data visualization understanding in artificial systems using measures that differ from those typically used to assess these abilities in humans. Here we evaluated eight vision-language models on six data visualization literacy assessments designed for humans and compared model responses to those of human participants. We found that these models performed worse than human participants on average, and this performance gap persisted even when using relatively lenient criteria to assess model performance. Moreover, while relative performance across items was somewhat correlated between models and humans, all models produced patterns of errors that were reliably distinct from those produced by human participants. Taken together, these findings suggest significant opportunities for further development of artificial systems that might serve as useful models of how humans reason about data visualizations. All code and data needed to reproduce these results are available at: https://osf.io/e25mu/?view_only=399daff5a14d4b16b09473cf19043f18.

CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models

TL;DR

CHART-6 introduces a human-centered benchmark to evaluate data-visualization understanding by comparing eight vision-language models with human participants across six tests. The authors implement a rigorous evaluation protocol, testing models on 851 items (GGR, VLAT, CALVI, HOLF, HOLF-Multi, ChartQA-Human) and analyzing validity, accuracy, and error patterns relative to humans. Across results, AI models underperform humans on average, with no model approaching the human noise ceiling, though GPT-4V often yields the best performance among models and shows partial alignment in relative strengths. The findings highlight gaps in mechanistic modeling of human visualization reasoning and propose future directions for unified measures, adaptive testing, and more human-aligned learning signals to advance cognitive benchmarking. The work provides open resources for reproducibility and positions CHART-6 as a platform to track progress toward human-like graphical reasoning in AI.

Abstract

Data visualizations are powerful tools for communicating patterns in quantitative data. Yet understanding any data visualization is no small feat -- succeeding requires jointly making sense of visual, numerical, and linguistic inputs arranged in a conventionalized format one has previously learned to parse. Recently developed vision-language models are, in principle, promising candidates for developing computational models of these cognitive operations. However, it is currently unclear to what degree these models emulate human behavior on tasks that involve reasoning about data visualizations. This gap reflects limitations in prior work that has evaluated data visualization understanding in artificial systems using measures that differ from those typically used to assess these abilities in humans. Here we evaluated eight vision-language models on six data visualization literacy assessments designed for humans and compared model responses to those of human participants. We found that these models performed worse than human participants on average, and this performance gap persisted even when using relatively lenient criteria to assess model performance. Moreover, while relative performance across items was somewhat correlated between models and humans, all models produced patterns of errors that were reliably distinct from those produced by human participants. Taken together, these findings suggest significant opportunities for further development of artificial systems that might serve as useful models of how humans reason about data visualizations. All code and data needed to reproduce these results are available at: https://osf.io/e25mu/?view_only=399daff5a14d4b16b09473cf19043f18.

Paper Structure

This paper contains 32 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Sample response from all evaluated models for a multiple-choice item. Responses after processing are shown in bold and are used for comparison against human and model responses. Responses without bold characters indicate invalid responses.
  • Figure 2: We present CHART-6 (Comparative Human-AI Graphical Reasoning Tests), a human-centered suite of data visualization understanding benchmarks, to assess how close state-of-the-art vision-language models are to achieving both human-level performance and human-like behavior on reasoning tasks involving data visualizations. This test suite spans a wide array of different approaches to designing such assessments, ensuring broad coverage of the skills that are considered to be important when assessing human data visualization literacy.
  • Figure 3: Procedure for processing and validating model responses for comparison to human responses. All vision-language models were presented with every test item 10 times. Each test item consisted of an image containing a data visualization and a question accompanying it, preceded by general task instructions. The raw output generated by each model was then processed independently by a different large-language model to extract the response in the correct format. These processed outputs were then scored and the pattern of errors compared to human error patterns.
  • Figure 4: Proportion of valid responses produced by each model on each assessment.
  • Figure 5: Human and model performance on (A) the mean proportion correct in multiple-choice assessments (GGR, VLAT, and CALVI) and (B) the mean relaxed accuracy in numerical-response assessments (HOLF, HOLF-multi, and ChartQA). Relaxed accuracy is calculated by the proportion of responses that fall within 5% of the correct answer. Empty circles represent estimates of model performance based on all responses, with any invalid responses marked as incorrect. Filled circles represent estimates of model performance based only on valid responses, and therefore reflect an upper bound on model performance. All error bars represent bootstrapped 95% confidence intervals.
  • ...and 2 more figures