Table of Contents
Fetching ...

VOICE: Visual Oracle for Interaction, Conversation, and Explanation

Donggang Jia, Alexandra Irger, Lonni Besancon, Ondrej Strnad, Deng Luo, Johanna Bjorklund, Anders Ynnerman, Ivan Viola

TL;DR

VOICE bridges large language models and real-time 3D molecular visualization to enable natural, voice-driven explanations of complex biological structures for lay audiences. It introduces a pack-of-bots dialogue system and a scene-tree–based interactive text-to-visualization pipeline to generate coherent visual narrations aligned with user queries. Evaluation with science-communication experts shows low latency, accurate content, and valuable feedback on guidance and representation, indicating strong potential for autonomous science communication in public centers. Future work will pursue unified instruction-extraction models, deeper integration of visual state into dialogue, and dynamic animations to further enhance learning outcomes.

Abstract

We present VOICE, a novel approach to science communication that connects large language models' (LLM) conversational capabilities with interactive exploratory visualization. VOICE introduces several innovative technical contributions that drive our conversational visualization framework. Our foundation is a pack-of-bots that can perform specific tasks, such as assigning tasks, extracting instructions, and generating coherent content. We employ fine-tuning and prompt engineering techniques to tailor bots' performance to their specific roles and accurately respond to user queries. Our interactive text-to-visualization method generates a flythrough sequence matching the content explanation. Besides, natural language interaction provides capabilities to navigate and manipulate the 3D models in real-time. The VOICE framework can receive arbitrary voice commands from the user and respond verbally, tightly coupled with corresponding visual representation with low latency and high accuracy. We demonstrate the effectiveness of our approach by applying it to the molecular visualization domain: analyzing three 3D molecular models with multi-scale and multi-instance attributes. We finally evaluate VOICE with the identified educational experts to show the potential of our approach. All supplemental materials are available at https://osf.io/g7fbr.

VOICE: Visual Oracle for Interaction, Conversation, and Explanation

TL;DR

VOICE bridges large language models and real-time 3D molecular visualization to enable natural, voice-driven explanations of complex biological structures for lay audiences. It introduces a pack-of-bots dialogue system and a scene-tree–based interactive text-to-visualization pipeline to generate coherent visual narrations aligned with user queries. Evaluation with science-communication experts shows low latency, accurate content, and valuable feedback on guidance and representation, indicating strong potential for autonomous science communication in public centers. Future work will pursue unified instruction-extraction models, deeper integration of visual state into dialogue, and dynamic animations to further enhance learning outcomes.

Abstract

We present VOICE, a novel approach to science communication that connects large language models' (LLM) conversational capabilities with interactive exploratory visualization. VOICE introduces several innovative technical contributions that drive our conversational visualization framework. Our foundation is a pack-of-bots that can perform specific tasks, such as assigning tasks, extracting instructions, and generating coherent content. We employ fine-tuning and prompt engineering techniques to tailor bots' performance to their specific roles and accurately respond to user queries. Our interactive text-to-visualization method generates a flythrough sequence matching the content explanation. Besides, natural language interaction provides capabilities to navigate and manipulate the 3D models in real-time. The VOICE framework can receive arbitrary voice commands from the user and respond verbally, tightly coupled with corresponding visual representation with low latency and high accuracy. We demonstrate the effectiveness of our approach by applying it to the molecular visualization domain: analyzing three 3D molecular models with multi-scale and multi-instance attributes. We finally evaluate VOICE with the identified educational experts to show the potential of our approach. All supplemental materials are available at https://osf.io/g7fbr.
Paper Structure (23 sections, 6 figures, 1 table, 3 algorithms)

This paper contains 23 sections, 6 figures, 1 table, 3 algorithms.

Figures (6)

  • Figure 1: VOICE's initial screen. VOICE can process an arbitrary speech request to answer a question, return a corresponding animation, or conversationally explore the model.
  • Figure 2: Overview of the VOICE framework. The dialogue system begins with a user's speech query. It uses a "pack-of-bots" architecture to process this query. The system either answers questions or follows instructions, which are then given to a visualization system.
  • Figure 3: Few-show prompt engineering and prompt-based fine-tuning. Few-shot prompt engineering enables direct output acquisition without altering the model. Conversely, prompt-based fine-tuning updates the model through multiple steps.
  • Figure 4: Overview scene (a), focus scene (b), and cutting plane scene (c). The overview scene shows the external component labels and spatial information, while the focus scene illustrates the structural details. The cutting plane scene displays the internal components.
  • Figure 5: The interactive text-to-visualization method is demonstrated in the scene tree. Minimum index values are updated based on the index value of each node. Then, a traversal list is generated based on the minimum index values.
  • ...and 1 more figures