Table of Contents
Fetching ...

Identifying & Interactively Refining Ambiguous User Goals for Data Visualization Code Generation

Mert İnan, Anthony Sicilia, Alex Xie, Saujas Vaduguru, Daniel Fried, Malihe Alikhani

TL;DR

Ambiguity in natural language to data visualization code is quantified via $E(I, \\mathbb{C}(U)) - E(I, \\mathbb{C}(U^*))$, motivating a director-coder framework that treats code generation as a cooperative dialogue. The authors develop a multimodal taxonomy of plotting-domain ambiguity and a set of automatic metrics to identify it, showing that these metrics align with human ambiguity annotations better than standard uncertainty baselines. They evaluate pragmatics-inspired dialogue strategies (Cooperative, Discoursive, Inquisitive) with GPT-4o on DS1000 Matplotlib problems, showing that interactive dialogue improves code accuracy and more effectively targets ambiguities than non-dialogue baselines. The work provides a principled approach to aligning user goals with generated visualization code and points to practical implications for next-generation interactive coding assistants.

Abstract

Establishing shared goals is a fundamental step in human-AI communication. However, ambiguities can lead to outputs that seem correct but fail to reflect the speaker's intent. In this paper, we explore this issue with a focus on the data visualization domain, where ambiguities in natural language impact the generation of code that visualizes data. The availability of multiple views on the contextual (e.g., the intended plot and the code rendering the plot) allows for a unique and comprehensive analysis of diverse ambiguity types. We develop a taxonomy of types of ambiguity that arise in this task and propose metrics to quantify them. Using Matplotlib problems from the DS-1000 dataset, we demonstrate that our ambiguity metrics better correlate with human annotations than uncertainty baselines. Our work also explores how multi-turn dialogue can reduce ambiguity, therefore, improve code accuracy by better matching user goals. We evaluate three pragmatic models to inform our dialogue strategies: Gricean Cooperativity, Discourse Representation Theory, and Questions under Discussion. A simulated user study reveals how pragmatic dialogues reduce ambiguity and enhance code accuracy, highlighting the value of multi-turn exchanges in code generation.

Identifying & Interactively Refining Ambiguous User Goals for Data Visualization Code Generation

TL;DR

Ambiguity in natural language to data visualization code is quantified via , motivating a director-coder framework that treats code generation as a cooperative dialogue. The authors develop a multimodal taxonomy of plotting-domain ambiguity and a set of automatic metrics to identify it, showing that these metrics align with human ambiguity annotations better than standard uncertainty baselines. They evaluate pragmatics-inspired dialogue strategies (Cooperative, Discoursive, Inquisitive) with GPT-4o on DS1000 Matplotlib problems, showing that interactive dialogue improves code accuracy and more effectively targets ambiguities than non-dialogue baselines. The work provides a principled approach to aligning user goals with generated visualization code and points to practical implications for next-generation interactive coding assistants.

Abstract

Establishing shared goals is a fundamental step in human-AI communication. However, ambiguities can lead to outputs that seem correct but fail to reflect the speaker's intent. In this paper, we explore this issue with a focus on the data visualization domain, where ambiguities in natural language impact the generation of code that visualizes data. The availability of multiple views on the contextual (e.g., the intended plot and the code rendering the plot) allows for a unique and comprehensive analysis of diverse ambiguity types. We develop a taxonomy of types of ambiguity that arise in this task and propose metrics to quantify them. Using Matplotlib problems from the DS-1000 dataset, we demonstrate that our ambiguity metrics better correlate with human annotations than uncertainty baselines. Our work also explores how multi-turn dialogue can reduce ambiguity, therefore, improve code accuracy by better matching user goals. We evaluate three pragmatic models to inform our dialogue strategies: Gricean Cooperativity, Discourse Representation Theory, and Questions under Discussion. A simulated user study reveals how pragmatic dialogues reduce ambiguity and enhance code accuracy, highlighting the value of multi-turn exchanges in code generation.

Paper Structure

This paper contains 42 sections, 1 equation, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: This figure summarizes the contributions of this paper. We formalize and identify ambiguity in data visualization code, then use pragmatics-inspired dialogue strategies to interactively resolve ambiguities in user intents. To this end, we present a multimodal taxonomy of ambiguity categories, and new metrics informed by this taxonomy.
  • Figure 2: This figure depicts the causal graph of a multimodal coding task. I: Image, C: ground truth code used to generate I, T: unit tests, P: director's prompt, and A: coder's answer. When ambiguity is prevalent, then the edges of this graph are cut, and the coder's answer relies on a non-descriptive prompt.
  • Figure 3: This figure shows the dialogue flow for a pragmatic coder, where the initial intent of the dialogue is given on the left, and the different responses generated using separate personas are given in the middle.
  • Figure 4: This figure shows a breakdown of the change in the mean pass@1 scores ($\Delta$ = post-dialogue $-$ original) across ambiguous and non-ambiguous instances of the DS1000 dataset. Dialogue shows better performance in ambiguous instances instead of non-ambiguous ones.
  • Figure 5: This figure shows the change in unique responses of code completions depending on the temperature of the model. From the left, the plots are showing histograms for 0.3, 0.7, and 1.0 temperatures. The horizontal axis is the question number from the DS1000 matplotlib dataset. It is observable that the uniqueness is high for higher temperatures, expectedly. However, very high temperatures may have minor differences that increase the overall uniqueness. Hence, a moderate temperature like 0.7 gives more reliable results for further experimentation.