Table of Contents
Fetching ...

Do Large Language Models Understand Data Visualization Rules?

Martin Sinnona, Valentin Bonas, Emmanuel Iarussi, Viviana Siless

TL;DR

The first systematic evaluation of LLMs against visualization rules using hard-verification ground truth derived from Answer Set Programming (ASP) is presented, which demonstrates the potential of LLMs as flexible, language-driven validators while highlighting their current limitations compared to symbolic solvers.

Abstract

Data visualization rules-derived from decades of research in design and perception-ensure trustworthy chart communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they can reason about and enforce visualization rules directly. Constraint-based systems such as Draco encode these rules as logical constraints for precise automated checks, but maintaining symbolic encodings requires expert effort, motivating the use of LLMs as flexible rule validators. In this paper, we present the first systematic evaluation of LLMs against visualization rules using hard-verification ground truth derived from Answer Set Programming (ASP). We translated a subset of Draco's constraints into natural-language statements and generated a controlled dataset of 2,000 Vega-Lite specifications annotated with explicit rule violations. LLMs were evaluated on both accuracy in detecting violations and prompt adherence, which measures whether outputs follow the required structured format. Results show that frontier models achieve high adherence (Gemma 3 4B / 27B: 100%, GPT-oss 20B: 98%) and reliably detect common violations (F1 up to 0.82),yet performance drops for subtler perceptual rules (F1 < 0.15 for some categories) and for outputs generated from technical ASP formulations.Translating constraints into natural language improved performance by up to 150% for smaller models. These findings demonstrate the potential of LLMs as flexible, language-driven validators while highlighting their current limitations compared to symbolic solvers.

Do Large Language Models Understand Data Visualization Rules?

TL;DR

The first systematic evaluation of LLMs against visualization rules using hard-verification ground truth derived from Answer Set Programming (ASP) is presented, which demonstrates the potential of LLMs as flexible, language-driven validators while highlighting their current limitations compared to symbolic solvers.

Abstract

Data visualization rules-derived from decades of research in design and perception-ensure trustworthy chart communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they can reason about and enforce visualization rules directly. Constraint-based systems such as Draco encode these rules as logical constraints for precise automated checks, but maintaining symbolic encodings requires expert effort, motivating the use of LLMs as flexible rule validators. In this paper, we present the first systematic evaluation of LLMs against visualization rules using hard-verification ground truth derived from Answer Set Programming (ASP). We translated a subset of Draco's constraints into natural-language statements and generated a controlled dataset of 2,000 Vega-Lite specifications annotated with explicit rule violations. LLMs were evaluated on both accuracy in detecting violations and prompt adherence, which measures whether outputs follow the required structured format. Results show that frontier models achieve high adherence (Gemma 3 4B / 27B: 100%, GPT-oss 20B: 98%) and reliably detect common violations (F1 up to 0.82),yet performance drops for subtler perceptual rules (F1 < 0.15 for some categories) and for outputs generated from technical ASP formulations.Translating constraints into natural language improved performance by up to 150% for smaller models. These findings demonstrate the potential of LLMs as flexible, language-driven validators while highlighting their current limitations compared to symbolic solvers.
Paper Structure (16 sections, 1 equation, 2 figures, 1 table)

This paper contains 16 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of the evaluation framework. (A) We generated random chart specifications and use Draco's solver to identify ground-truth visualization problems. Apply a Kullback--Leibler (KL) divergence filter to ensure a balanced distribution of problem types. Then, we converted the accepted specifications into Vega-Lite format, yielding a dataset of 2,000 annotated instances (B). (C.1) Subsequently, we evaluated LLMs by prompting each specification with a randomly sampled instruction prompt from five variants, describing the desired output format, and the list of possible problems. (C.2) Finally, we compared model predictions against the ground truth to compute accuracy and prompt adherence metrics.
  • Figure 2: Effect of KL-divergence filtering during dataset generation. (A) Distribution of visualization problems across the dataset before (gray) and after (orange) applying KL filtering. The filter substantially reduces skew, leading to a more balanced coverage of problem categories. (B) Evolution of KL-divergence over iterations with and without filtering. The KL strategy consistently converges toward lower divergence, indicating that the resulting dataset more closely approximates a uniform distribution of problems.