Table of Contents
Fetching ...

True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

Graziano Blasilli, Marco Angelini

Abstract

This study investigates the ability of multimodal Large Language Models (LLMs) to identify and interpret misleading visualizations, and recognize these observations along with their underlying causes and potential intentionality. Our analysis leverages concepts from visualization rhetoric and a newly developed taxonomy of authorial intents as explanatory lenses. We formulated three research questions and addressed them experimentally using a dataset of 2,336 COVID-19-related tweets, half of which contain misleading visualizations, and supplemented it with real-world examples of perceptual, cognitive, and conceptual errors drawn from VisLies, the IEEE VIS community event dedicated to showcasing deceptive and misleading visualizations. To ensure broad coverage of the current LLM landscape, we evaluated 16 state-of-the-art models. Among them, 15 are open-weight models, spanning a wide range of model sizes, architectural families, and reasoning capabilities. The selection comprises small models, namely Nemotron-Nano-V2-VL (12B parameters), Mistral-Small-3.2 (24B), DeepSeek-VL2 (27B), Gemma3 (27B), and GTA1 (32B); medium-sized models, namely Qianfan-VL (70B), Molmo (72B), GLM-4.5V (108B), LLaVA-NeXT (110B), and Pixtral-Large (124B); and large models, namely Qwen3-VL (235B), InternVL3.5 (241B), Step3 (321B), Llama-4-Maverick (400B), and Kimi-K2.5 (1000B). In addition, we employed OpenAI GPT-5.4, a frontier proprietary model. To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations. This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.

True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

Abstract

This study investigates the ability of multimodal Large Language Models (LLMs) to identify and interpret misleading visualizations, and recognize these observations along with their underlying causes and potential intentionality. Our analysis leverages concepts from visualization rhetoric and a newly developed taxonomy of authorial intents as explanatory lenses. We formulated three research questions and addressed them experimentally using a dataset of 2,336 COVID-19-related tweets, half of which contain misleading visualizations, and supplemented it with real-world examples of perceptual, cognitive, and conceptual errors drawn from VisLies, the IEEE VIS community event dedicated to showcasing deceptive and misleading visualizations. To ensure broad coverage of the current LLM landscape, we evaluated 16 state-of-the-art models. Among them, 15 are open-weight models, spanning a wide range of model sizes, architectural families, and reasoning capabilities. The selection comprises small models, namely Nemotron-Nano-V2-VL (12B parameters), Mistral-Small-3.2 (24B), DeepSeek-VL2 (27B), Gemma3 (27B), and GTA1 (32B); medium-sized models, namely Qianfan-VL (70B), Molmo (72B), GLM-4.5V (108B), LLaVA-NeXT (110B), and Pixtral-Large (124B); and large models, namely Qwen3-VL (235B), InternVL3.5 (241B), Step3 (321B), Llama-4-Maverick (400B), and Kimi-K2.5 (1000B). In addition, we employed OpenAI GPT-5.4, a frontier proprietary model. To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations. This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.

Paper Structure

This paper contains 25 sections, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Schematic representation of the prompt template. Black text is shared across all conditions. Condition A does not inform the model about misleadingness. Condition B states that the visualization is misleading. Condition C extends B by providing the ground-truth error list. Task 3 asks the model to identify rhetorics ( E.1x) or intents ( E.2x).
  • Figure 2: MCC scores for all 16 models in E.1A (Rhetoric) and E.2A (Intent), with rank stability between the two conditions shown on the right. Overall performance is low in both experiments (mean MCC $0.113$ and $0.132$, respectively), with gpt and maverick leading consistently. Color encodes model size (small, medium, large); larger models tend to score higher, a trend that is statistically significant in E.2A.
  • Figure 3: Contribution probability matrices $P(\text{rhetoric} \mid \text{error})$ (blue) and $P(\text{intent} \mid \text{error})$ (green) collected from the visualization experts (a) and selected models (b--e) under experimental condition C. Columns correspond to error types; the vertical black line separates visualization design violations (left) from reasoning errors (right). For rhetoric by humans (a), the cross-family transition is visible: Mapping dominates the design violation columns, while Information Access and Linguistic rise for reasoning errors. For intent by humans (a), the horizontal black line separates intentional (top) from unintentional (bottom) intents; unintentional intents have higher probabilities for design violations, and intentional intents for reasoning errors.
  • Figure 4: Pairwise behavioral similarity matrices under condition C for rhetoric (a) and intent (b). Rows and columns are ordered by hierarchical clustering. By condition C, providing the ground-truth error drives models toward a common behavior, compressing the similarity range relative to conditions A and B (see \ref{['fig:similarity_rhetoric_app', 'fig:similarity_intent_app']}). llava and deepseek remain the most isolated models; gpt and kimi form the most stable high-similarity pair. Human similarity with the model population remains moderate (at most $0.81$ for rhetoric, $0.73$ for intent), confirming that models converge toward a shared yet human-divergent pattern.
  • Figure 5: UMAP projections of rhetoric contribution explanations across experiments E.1A, E.1B, and E.1C. Each point represents one tweet-model pair; color encodes ground truth (blue: non-misleading, red: misleading). The progressive dispersion of the embedding space reflects the increasing prior knowledge provided to the model across conditions.
  • ...and 11 more figures