Chart What I Say: Exploring Cross-Modality Prompt Alignment in AI-Assisted Chart Authoring
Nazar Ponochevnyi, Anastasia Kuzminykh
TL;DR
The paper investigates cross-modality prompt alignment in AI-assisted chart authoring by comparing spoken versus typed instructions. It collects a voice dataset (n=25, 100 prompts) and two text datasets (NLV Corpus and nvBench), applying qualitative coding to identify 6 input strategies and 22 chart-element types across 5 categories. It finds that voice prompts are longer and more diverse, with greater linguistic complexity, while text prompts concentrate on basic elements, signaling the need for modality-specific processing and design guidelines. The authors contribute a design framework for voice-enabled authoring, actionable guidelines for text-based systems to support speech modality, and a publicly available voice-instruction dataset to support development and evaluation of multimodal visualization tools.
Abstract
Recent chart-authoring systems, such as Amazon Q in QuickSight and Copilot for Power BI, demonstrate an emergent focus on supporting natural language input to share meaningful insights from data through chart creation. Currently, chart-authoring systems tend to integrate voice input capabilities by relying on speech-to-text transcription, processing spoken and typed input similarly. However, cross-modality input comparisons in other interaction domains suggest that the structure of spoken and typed-in interactions could notably differ, reflecting variations in user expectations based on interface affordances. Thus, in this work, we compare spoken and typed instructions for chart creation. Findings suggest that while both text and voice instructions cover chart elements and element organization, voice descriptions have a variety of command formats, element characteristics, and complex linguistic features. Based on these findings, we developed guidelines for designing voice-based authoring-oriented systems and additional features that can be incorporated into existing text-based systems to support speech modality.
