Table of Contents
Fetching ...

Do Large Language Models Understand Data Visualization Principles?

Martin Sinnona, Valentin Bonas, Viviana Siless, Emmanuel Iarussi

TL;DR

This work presents the first systematic evaluation of both LLMs and VLMs on their ability to reason about visualization principles, using hard verification ground truth derived from Answer Set Programming (ASP).

Abstract

Data visualization principles, derived from decades of research in design and perception, ensure proper visual communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they and their vision-language counterparts (VLMs) can reason about and enforce visualization principles directly. Constraint based systems encode these principles as logical rules for precise automated checks, but translating them into formal specifications demands expert knowledge. This motivates leveraging LLMs and VLMs as principle checkers that can reason about visual design directly, bypassing the need for symbolic rule specification. In this paper, we present the first systematic evaluation of both LLMs and VLMs on their ability to reason about visualization principles, using hard verification ground truth derived from Answer Set Programming (ASP). We compiled a set of visualization principles expressed as natural-language statements and generated a controlled dataset of approximately 2,000 Vega-Lite specifications annotated with explicit principle violations, complemented by over 300 real-world Vega-Lite charts. We evaluated both checking and fixing tasks, assessing how well models detect principle violations and correct flawed chart specifications. Our work highlights both the promise of large (vision-)language models as flexible validators and editors of visualization designs and the persistent gap with symbolic solvers on more nuanced aspects of visual perception. They also reveal an interesting asymmetry: frontier models tend to be more effective at correcting violations than at detecting them reliably.

Do Large Language Models Understand Data Visualization Principles?

TL;DR

This work presents the first systematic evaluation of both LLMs and VLMs on their ability to reason about visualization principles, using hard verification ground truth derived from Answer Set Programming (ASP).

Abstract

Data visualization principles, derived from decades of research in design and perception, ensure proper visual communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they and their vision-language counterparts (VLMs) can reason about and enforce visualization principles directly. Constraint based systems encode these principles as logical rules for precise automated checks, but translating them into formal specifications demands expert knowledge. This motivates leveraging LLMs and VLMs as principle checkers that can reason about visual design directly, bypassing the need for symbolic rule specification. In this paper, we present the first systematic evaluation of both LLMs and VLMs on their ability to reason about visualization principles, using hard verification ground truth derived from Answer Set Programming (ASP). We compiled a set of visualization principles expressed as natural-language statements and generated a controlled dataset of approximately 2,000 Vega-Lite specifications annotated with explicit principle violations, complemented by over 300 real-world Vega-Lite charts. We evaluated both checking and fixing tasks, assessing how well models detect principle violations and correct flawed chart specifications. Our work highlights both the promise of large (vision-)language models as flexible validators and editors of visualization designs and the persistent gap with symbolic solvers on more nuanced aspects of visual perception. They also reveal an interesting asymmetry: frontier models tend to be more effective at correcting violations than at detecting them reliably.
Paper Structure (35 sections, 1 equation, 4 figures, 3 tables)

This paper contains 35 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Bridging symbolic and language-based reasoning for visualization principle checking. Given a chart specification and its corresponding rendering (A), we automatically check compliance with visualization principles using two complementary reasoning approaches: formal verification via handcrafted ASP constraints and language/image-based reasoning from natural-language prompts and chart renderings (B). The detected infractions are then compared (C), with performance metrics evaluated across both checking and fixing tasks.
  • Figure 2: Overview of the benchmark datasets and generation pipeline.(A) The synthetic dataset is produced by sampling Draco chart specifications from tabular data, automatically generating Vega-Lite renderings and ground-truth annotations of design principle violations. A Kullback–Leibler (KL) divergence filter ensures balanced coverage across violation types by accepting only candidates that improve uniformity in the distribution of problems. (B) The real visualization dataset complements the synthetic corpus by translating human-authored Vega-Lite specifications from GitHub into Draco grammar, enabling principle-level analysis of authentic visualization practices.
  • Figure 3: Evaluation setup for assessing LLM understanding of visualization principles. Each chart instance is defined by its Vega-Lite specification and annotated with one or more principle violations (e.g., high_cardinality_shape, stack_discrete, number_categorical). Structured prompts (five semantically equivalent variants per principle) are used to query each model. Model outputs, expressed in a fixed JSON schema, are validated and compared to the reference annotations to compute F1-scores across all principles.
  • Figure 4: Model performance across mark types and problem categories.(A) Mean F1-scores by mark type, showing how different visual encodings influence model accuracy. (B) Mean F1-scores by principle sorted by Gemini-2.5-Flash performance.