Table of Contents
Fetching ...

Exploring Multimodal Prompt for Visualization Authoring with Large Language Models

Zhen Wen, Luoxuan Weng, Yinghao Tang, Runjin Zhang, Yuxin Liu, Bo Pan, Minfeng Zhu, Wei Chen

TL;DR

This work identifies the limitations of text-only prompts for visualization authoring by analyzing a large corpus of prompts and LLM interpretations. It introduces VisPilot, a multimodal prompting framework that incorporates visual prompts (sketches, annotations, manipulations) with text prompts to clarify user intent and guide LLMs toward precise Vega-Lite specifications. Through case studies and a controlled user study, the authors show that multimodal prompting improves accuracy and user satisfaction without sacrificing task efficiency, and they articulate design principles for future multimodal visualization systems. The study demonstrates the potential of combining sketch-based input and direct manipulation to enhance human-AI collaboration in creative visualization tasks, with broad implications for future visualization tools and education.

Abstract

Recent advances in large language models (LLMs) have shown great potential in automating the process of visualization authoring through simple natural language utterances. However, instructing LLMs using natural language is limited in precision and expressiveness for conveying visualization intent, leading to misinterpretation and time-consuming iterations. To address these limitations, we conduct an empirical study to understand how LLMs interpret ambiguous or incomplete text prompts in the context of visualization authoring, and the conditions making LLMs misinterpret user intent. Informed by the findings, we introduce visual prompts as a complementary input modality to text prompts, which help clarify user intent and improve LLMs' interpretation abilities. To explore the potential of multimodal prompting in visualization authoring, we design VisPilot, which enables users to easily create visualizations using multimodal prompts, including text, sketches, and direct manipulations on existing visualizations. Through two case studies and a controlled user study, we demonstrate that VisPilot provides a more intuitive way to create visualizations without affecting the overall task efficiency compared to text-only prompting approaches. Furthermore, we analyze the impact of text and visual prompts in different visualization tasks. Our findings highlight the importance of multimodal prompting in improving the usability of LLMs for visualization authoring. We discuss design implications for future visualization systems and provide insights into how multimodal prompts can enhance human-AI collaboration in creative visualization tasks. All materials are available at https://OSF.IO/2QRAK.

Exploring Multimodal Prompt for Visualization Authoring with Large Language Models

TL;DR

This work identifies the limitations of text-only prompts for visualization authoring by analyzing a large corpus of prompts and LLM interpretations. It introduces VisPilot, a multimodal prompting framework that incorporates visual prompts (sketches, annotations, manipulations) with text prompts to clarify user intent and guide LLMs toward precise Vega-Lite specifications. Through case studies and a controlled user study, the authors show that multimodal prompting improves accuracy and user satisfaction without sacrificing task efficiency, and they articulate design principles for future multimodal visualization systems. The study demonstrates the potential of combining sketch-based input and direct manipulation to enhance human-AI collaboration in creative visualization tasks, with broad implications for future visualization tools and education.

Abstract

Recent advances in large language models (LLMs) have shown great potential in automating the process of visualization authoring through simple natural language utterances. However, instructing LLMs using natural language is limited in precision and expressiveness for conveying visualization intent, leading to misinterpretation and time-consuming iterations. To address these limitations, we conduct an empirical study to understand how LLMs interpret ambiguous or incomplete text prompts in the context of visualization authoring, and the conditions making LLMs misinterpret user intent. Informed by the findings, we introduce visual prompts as a complementary input modality to text prompts, which help clarify user intent and improve LLMs' interpretation abilities. To explore the potential of multimodal prompting in visualization authoring, we design VisPilot, which enables users to easily create visualizations using multimodal prompts, including text, sketches, and direct manipulations on existing visualizations. Through two case studies and a controlled user study, we demonstrate that VisPilot provides a more intuitive way to create visualizations without affecting the overall task efficiency compared to text-only prompting approaches. Furthermore, we analyze the impact of text and visual prompts in different visualization tasks. Our findings highlight the importance of multimodal prompting in improving the usability of LLMs for visualization authoring. We discuss design implications for future visualization systems and provide insights into how multimodal prompts can enhance human-AI collaboration in creative visualization tasks. All materials are available at https://OSF.IO/2QRAK.

Paper Structure

This paper contains 28 sections, 7 figures.

Figures (7)

  • Figure 1: The procedure of LLM processing on the corpus data.
  • Figure 2: The analysis results of LLM interpretation of utterances. (A) The number of implicit inferences made for each utterance, where each item of specification components is counted as a separate inference if it is inferred by the LLM. (B) The percentage of explicit references or implicit inferences made for each specification component, where a component is counted as implicit if at least one of its items is implicitly inferred.
  • Figure 3: Our prompting framework instructs the LLM to interpret visual prompts to visualization specifications step by step.
  • Figure 4: The interface of VisPilot includes four components: (A) Chat Interface, (B) Free-drawing Canvas, (C) Design Panel, and (D) Authoring Flow.
  • Figure 5: The use case of VisPilot for data exploration. A user explores a dataset containing information about the top tracks on Spotify through an iterative process of sketching (A1--A8) and visualization generation (B1--B8).
  • ...and 2 more figures