How Aligned are Human Chart Takeaways and LLM Predictions? A Case Study on Bar Charts with Varying Layouts

Huichen Will Wang; Jane Hoffswell; Sao Myat Thazin Thane; Victor S. Bursztyn; Cindy Xiong Bearfield

How Aligned are Human Chart Takeaways and LLM Predictions? A Case Study on Bar Charts with Varying Layouts

Huichen Will Wang, Jane Hoffswell, Sao Myat Thazin Thane, Victor S. Bursztyn, Cindy Xiong Bearfield

TL;DR

This case study evaluates the ability of LLMs to emulate human interpretations of data and points to challenges and opportunities in using LLMs to predict human chart takeaways.

Abstract

Large Language Models (LLMs) have been adopted for a variety of visualizations tasks, but how far are we from perceptually aware LLMs that can predict human takeaways? Graphical perception literature has shown that human chart takeaways are sensitive to visualization design choices, such as spatial layouts. In this work, we examine the extent to which LLMs exhibit such sensitivity when generating takeaways, using bar charts with varying spatial layouts as a case study. We conducted three experiments and tested four common bar chart layouts: vertically juxtaposed, horizontally juxtaposed, overlaid, and stacked. In Experiment 1, we identified the optimal configurations to generate meaningful chart takeaways by testing four LLMs, two temperature settings, nine chart specifications, and two prompting strategies. We found that even state-of-the-art LLMs struggled to generate semantically diverse and factually accurate takeaways. In Experiment 2, we used the optimal configurations to generate 30 chart takeaways each for eight visualizations across four layouts and two datasets in both zero-shot and one-shot settings. Compared to human takeaways, we found that the takeaways LLMs generated often did not match the types of comparisons made by humans. In Experiment 3, we examined the effect of chart context and data on LLM takeaways. We found that LLMs, unlike humans, exhibited variation in takeaway comparison types for different bar charts using the same bar layout. Overall, our case study evaluates the ability of LLMs to emulate human interpretations of data and points to challenges and opportunities in using LLMs to predict human chart takeaways.

How Aligned are Human Chart Takeaways and LLM Predictions? A Case Study on Bar Charts with Varying Layouts

TL;DR

This case study evaluates the ability of LLMs to emulate human interpretations of data and points to challenges and opportunities in using LLMs to predict human chart takeaways.

Abstract

Paper Structure (32 sections, 2 equations, 8 figures, 10 tables)

This paper contains 32 sections, 2 equations, 8 figures, 10 tables.

Introduction
Related Work
LLMs for Visualization
Comparison Types in Bar Charts
Experiment 1: Optimal Configurations
Model Types and Temperature Settings
Datasets and Chart Specifications
Prompting
Baseline Strategy
Guided Discovery
Prompts for one-shot settings
Experiment 1: Procedure and Setup
Experiment 1: Evaluation Approaches
Semantic Diversity
Factual Accuracy
...and 17 more sections

Figures (8)

Figure 1: Figure from Xiong et alxiong2021visual showing four spatial arrangements.
Figure 2: Our case study includes three experiments. In Experiment 1, we varied the LLM, decoding temperature, chart specification, and prompting strategy, and identified optimal configurations to elicit LLM chart takeaways for both zero-shot and one-shot settings. In Experiment 2, we generated takeaways using optimal configurations and examined whether LLMs’ comparisons are perceptually sensitive to bar arrangement like humans are. In Experiment 3, we examined whether LLMs’ comparisons are insensitive to data and context like humans are.
Figure 3: A horizontally juxtaposed bar chart depicting the revenue of three stores from two companies. Figure is from Xiong et alxiong2021visual.
Figure 4: Distribution of the average cluster count for each configuration (broken down by the LLM type) for the zero-shot and one-shot setting. We reviewed the accuracy of the top 25% (see Section \ref{['semantic diversity']}), corresponding to thresholds of 21 and 19.5 for the zero-shot and one-shot settings.
Figure 5: Distributions of human, LLM zero-shot, and LLM one-shot comparison types for each layout. LLM zero-shot distributions are generally closer than one-shot distributions to human ones.
...and 3 more figures

How Aligned are Human Chart Takeaways and LLM Predictions? A Case Study on Bar Charts with Varying Layouts

TL;DR

Abstract

How Aligned are Human Chart Takeaways and LLM Predictions? A Case Study on Bar Charts with Varying Layouts

Authors

TL;DR

Abstract

Table of Contents

Figures (8)