Table of Contents
Fetching ...

DracoGPT: Extracting Visualization Design Preferences from Large Language Models

Huichen Will Wang, Mitchell Gordon, Leilani Battle, Jeffrey Heer

TL;DR

It is demonstrated that DracoGPT can accurately model the preferences expressed by LLMs, enabling analysis in terms of Draco design constraints and to provide a robust and cost-effective stand-in for LLMs.

Abstract

Trained on vast corpora, Large Language Models (LLMs) have the potential to encode visualization design knowledge and best practices. However, if they fail to do so, they might provide unreliable visualization recommendations. What visualization design preferences, then, have LLMs learned? We contribute DracoGPT, a method for extracting, modeling, and assessing visualization design preferences from LLMs. To assess varied tasks, we develop two pipelines--DracoGPT-Rank and DracoGPT-Recommend--to model LLMs prompted to either rank or recommend visual encoding specifications. We use Draco as a shared knowledge base in which to represent LLM design preferences and compare them to best practices from empirical research. We demonstrate that DracoGPT can accurately model the preferences expressed by LLMs, enabling analysis in terms of Draco design constraints. Across a suite of backing LLMs, we find that DracoGPT-Rank and DracoGPT-Recommend moderately agree with each other, but both substantially diverge from guidelines drawn from human subjects experiments. Future work can build on our approach to expand Draco's knowledge base to model a richer set of preferences and to provide a robust and cost-effective stand-in for LLMs.

DracoGPT: Extracting Visualization Design Preferences from Large Language Models

TL;DR

It is demonstrated that DracoGPT can accurately model the preferences expressed by LLMs, enabling analysis in terms of Draco design constraints and to provide a robust and cost-effective stand-in for LLMs.

Abstract

Trained on vast corpora, Large Language Models (LLMs) have the potential to encode visualization design knowledge and best practices. However, if they fail to do so, they might provide unreliable visualization recommendations. What visualization design preferences, then, have LLMs learned? We contribute DracoGPT, a method for extracting, modeling, and assessing visualization design preferences from LLMs. To assess varied tasks, we develop two pipelines--DracoGPT-Rank and DracoGPT-Recommend--to model LLMs prompted to either rank or recommend visual encoding specifications. We use Draco as a shared knowledge base in which to represent LLM design preferences and compare them to best practices from empirical research. We demonstrate that DracoGPT can accurately model the preferences expressed by LLMs, enabling analysis in terms of Draco design constraints. Across a suite of backing LLMs, we find that DracoGPT-Rank and DracoGPT-Recommend moderately agree with each other, but both substantially diverge from guidelines drawn from human subjects experiments. Future work can build on our approach to expand Draco's knowledge base to model a richer set of preferences and to provide a robust and cost-effective stand-in for LLMs.
Paper Structure (28 sections, 9 figures, 2 tables)

This paper contains 28 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of the DracoGPT-Rank pipeline. (1) User provides prompt templates for an LLM to rank chart pairs; (2) Draco featurizes charts and produces feature vectors consisting of constraint counts; (3) Draco learns constraint weights over LLM-labeled chart pairs by fitting a RankSVM model; (4) The fitted Draco model can be applied to score charts. Results at each stage of the pipeline afford insight into LLM ranking preferences.
  • Figure 2: Overview of the DracoGPT-Recommend chart pair construction pipeline. (1) Given an input chart pair, the pipeline extracts their shared partial specification, then (2) prompts an LLM to "optimally" complete the partial specification. (3) The pipeline constructs up to two new chart pairs for training a Draco model: the LLM completion is labeled as the positive example and an input chart as the negative example.
  • Figure 3: Distribution of positive and negative examples by encoding specification and interpretation task type for Kim et al and GPT4-Turbo. Only chart pairs for which GPT4-Turbo provides consistent responses are included. These training sets have similar distributions for value tasks, but notably diverge across summary tasks.
  • Figure 4: (A) The number of times each constraint is satisfied in chart pairs where GPT4-Turbo labels disagree with Kim et al. By analyzing where constraint counts diverge, we see, for example, that GPT4-Turbo is more likely to label a chart negative if it uses a size channel (linear_size, interesting_size) or an ordinal_y scale. (B) Constraint weights in the fitted models for Kim et al results (blue) and DracoGPT-Rank (gold). The weights listed last exhibit opposite signs, indicating model differences. By analyzing where constraint weights diverge, we see, for example, that the models strongly disagree on the use of a continuous size encoding for summary tasks (summary_continuous_size).
  • Figure 5: A chart pair demonstrating constraint weight trade-offs. Despite higher weights for a linear x scale (weight = 0.469) and ordinal y scale (weight = 0.721), GPT4-Turbo prefers the chart on the left. This chart obtains a lower cost because DracoGPT-Rank has a stronger preference for encoding the variable of interest q1 with the x channel (weight = -1.964) and a continuous x scale (weight = -1.055) for value tasks. Therefore, it is important to evaluate design preferences at the chart level to complement weight-level analysis.
  • ...and 4 more figures