Do LLMs Have Visualization Literacy? An Evaluation on Modified Visualizations to Test Generalization in Data Interpretation
Jiayi Hong, Christian Seto, Arlen Fan, Ross Maciejewski
TL;DR
The paper assesses whether large language models (GPT-4 and Gemini) possess visualization literacy by applying a modified 53-item VLAT to PNG visualizations and comparing results to human baselines. Using a rigorous experimental design with visuals present/absent, choice constraints, and decontextualized data, the authors employ bootstrapped logistic regression to analyze model performance across 49 visualization–task interactions. They find that current LLMs generally underperform humans in VL and rely heavily on pre-existing knowledge rather than the presented visual content, with limited evidence that providing visuals or choices robustly improves VL. The study also analyzes cost implications, time efficiency, and potential future directions, arguing that LLMs could serve as a cost-effective preliminary evaluator but cannot yet replace human judgment in visualization interpretation. The work provides a template for evaluating VL in LLMs and highlights the need for targeted fine-tuning and prompting strategies to realize reliable multimodal reading capabilities.
Abstract
In this paper, we assess the visualization literacy of two prominent Large Language Models (LLMs): OpenAI's Generative Pretrained Transformers (GPT), the backend of ChatGPT, and Google's Gemini, previously known as Bard, to establish benchmarks for assessing their visualization capabilities. While LLMs have shown promise in generating chart descriptions, captions, and design suggestions, their potential for evaluating visualizations remains under-explored. Collecting data from humans for evaluations has been a bottleneck for visualization research in terms of both time and money, and if LLMs were able to serve, even in some limited role, as evaluators, they could be a significant resource. To investigate the feasibility of using LLMs in the visualization evaluation process, we explore the extent to which LLMs possess visualization literacy -- a crucial factor for their effective utility in the field. We conducted a series of experiments using a modified 53-item Visualization Literacy Assessment Test (VLAT) for GPT-4 and Gemini. Our findings indicate that the LLMs we explored currently fail to achieve the same levels of visualization literacy when compared to data from the general public reported in VLAT, and LLMs heavily relied on their pre-existing knowledge to answer questions instead of utilizing the information provided by the visualization when answering questions.
