Table of Contents
Fetching ...

When Visuals Aren't the Problem: Evaluating Vision-Language Models on Misleading Data Visualizations

Harsh Nishant Lalai, Raj Sanjay Shah, Hanspeter Pfister, Sashank Varma, Grace Guo

Abstract

Visualizations help communicate data insights, but deceptive data representations can distort their interpretation and propagate misinformation. While recent Vision Language Models (VLMs) perform well on many chart understanding tasks, their ability to detect misleading visualizations, especially when deception arises from subtle reasoning errors in captions, remains poorly understood. Here, we evaluate VLMs on misleading visualization-caption pairs grounded in a fine-grained taxonomy of reasoning errors (e.g., Cherry-picking, Causal inference) and visualization design errors (e.g., Truncated axis, Dual axis, inappropriate encodings). To this end, we develop a benchmark that combines real-world visualization with human-authored, curated misleading captions designed to elicit specific reasoning and visualization error types, enabling controlled analysis across error categories and modalities of misleadingness. Evaluating many commercial and open-source VLMs, we find that models detect visual design errors substantially more reliably than reasoning-based misinformation, and frequently misclassify non-misleading visualizations as deceptive. Overall, our work fills a gap between coarse detection of misleading content and the attribution of the specific reasoning or visualization errors that give rise to it.

When Visuals Aren't the Problem: Evaluating Vision-Language Models on Misleading Data Visualizations

Abstract

Visualizations help communicate data insights, but deceptive data representations can distort their interpretation and propagate misinformation. While recent Vision Language Models (VLMs) perform well on many chart understanding tasks, their ability to detect misleading visualizations, especially when deception arises from subtle reasoning errors in captions, remains poorly understood. Here, we evaluate VLMs on misleading visualization-caption pairs grounded in a fine-grained taxonomy of reasoning errors (e.g., Cherry-picking, Causal inference) and visualization design errors (e.g., Truncated axis, Dual axis, inappropriate encodings). To this end, we develop a benchmark that combines real-world visualization with human-authored, curated misleading captions designed to elicit specific reasoning and visualization error types, enabling controlled analysis across error categories and modalities of misleadingness. Evaluating many commercial and open-source VLMs, we find that models detect visual design errors substantially more reliably than reasoning-based misinformation, and frequently misclassify non-misleading visualizations as deceptive. Overall, our work fills a gap between coarse detection of misleading content and the attribution of the specific reasoning or visualization errors that give rise to it.
Paper Structure (33 sections, 7 figures, 24 tables)

This paper contains 33 sections, 7 figures, 24 tables.

Figures (7)

  • Figure 1: Example of a misleading chart-caption pair with both visual design and reasoning errors. The chart contains a dual-axis visualization design error, which may be confusing because viewers must mentally map each axis to its corresponding visual representation (bar or line). The caption also introduces a reasoning error by extrapolating a cherry-picked short-term increase to a broader causal claim. Together, these factors can distort interpretation without altering the underlying data.
  • Figure 2: Structure of our dataset organized as a 2$\times$2 grid based on the presence or absence of misleading content in captions and visualizations. Counts denote the number of chart-caption pairs in each cell. Symbols denote error composition: $\varnothing$ no errors, $\triangle$ caption-only errors, $\bigcirc$ visualization-only errors, $\blacksquare$ joint errors.
  • Figure 3: Combined weighted F1 scores for VLMs on benchmark subsets containing only one modality of misinformation.
  • Figure 4: EM scores on the Non-Misleading Caption, Non-Misleading Visualization (case $\varnothing$) subset of the benchmark. Most VLMs incorrectly flag clean examples as containing at least one error.
  • Figure 5: Reasoning Error Composition for the dataset.
  • ...and 2 more figures