Table of Contents
Fetching ...

The Truth, the Whole Truth, and Nothing but the Truth: Automatic Visualization Evaluation from Reconstruction Quality

Roxana Bujack, Li-Ta Lo, Ethan Stam, Ayan Biswas, David Rogers

Abstract

Recent advances in AI enable the automatic generation of visualizations directly from textual prompts using agentic workflows. However, visualizations produced via one-shot generative methods often suffer from insufficient quality, typically requiring a human in the loop to refine the outputs. Human evaluation, though effective, is costly and impractical at scale. To alleviate this problem, we propose an automated metric that evaluates visualization quality without relying on extensive human-labeled datasets. Instead, our approach uses the original underlying data as implicit ground truth. Specifically, we introduce a method that measures visualization quality by assessing the reconstruction accuracy of the original data from the visualization itself. This reconstruction-based metric provides an autonomous and scalable proxy for thorough human evaluation, facilitating more efficient and reliable AI-driven visualization workflows.

The Truth, the Whole Truth, and Nothing but the Truth: Automatic Visualization Evaluation from Reconstruction Quality

Abstract

Recent advances in AI enable the automatic generation of visualizations directly from textual prompts using agentic workflows. However, visualizations produced via one-shot generative methods often suffer from insufficient quality, typically requiring a human in the loop to refine the outputs. Human evaluation, though effective, is costly and impractical at scale. To alleviate this problem, we propose an automated metric that evaluates visualization quality without relying on extensive human-labeled datasets. Instead, our approach uses the original underlying data as implicit ground truth. Specifically, we introduce a method that measures visualization quality by assessing the reconstruction accuracy of the original data from the visualization itself. This reconstruction-based metric provides an autonomous and scalable proxy for thorough human evaluation, facilitating more efficient and reliable AI-driven visualization workflows.
Paper Structure (29 sections, 13 figures)

This paper contains 29 sections, 13 figures.

Figures (13)

  • Figure 1: Log-log plot showing the strong correlation between discriminative power and reconstruction error for colormaps that preserve legend-based order.
  • Figure 2: Examples of reconstruction and error patterns for various colormaps. The visualizations highlight the importance of both order preservation and discriminative power for supporting accurate mental models.
  • Figure 3: Examples of reconstruction and error patterns for the 3D shaded reconstruction task based on the metric using the absolute difference in hue in CIELCH. This metric favors colorful colormaps.
  • Figure 4: Log-log plot showing the absence of any correlation between the discriminative power and reconstruction error for the reconstruction of shaded images. For the hue-based metric, colorful colormaps produce lower errors because they are more robust to shading in 3D colormapped images.
  • Figure 5: Log-log plots showing the reconstruction errors of the Matplotlib colormaps for the different metrics.
  • ...and 8 more figures