Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data

Omar Mena; Alexandre Kouyoumdjian; Lonni Besançon; Michael Gleicher; Ivan Viola; Anders Ynnerman

Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data

Omar Mena, Alexandre Kouyoumdjian, Lonni Besançon, Michael Gleicher, Ivan Viola, Anders Ynnerman

TL;DR

This work tackles the problem of LLM blindness to visual data by introducing visioverbal augmentation, a method that pre-processes rendered visualizations and their textual descriptions into a compact JSON format to augment LLM prompts. The approach uses a two-stage pipeline: a one-time preprocessing per dataset to extract visual cues from frames and merge them with text, and a runtime loop that runs the conversation with a dual-LLM setup (information-delivery GPT-4o and command-issuing GPT-4). The proof-of-concept, demonstrated on Science On a Sphere geospatial visualizations within the TellUs framework, shows that structured augmentation substantially improves QA accuracy over baselines and achieves high user usability (SUS ~83.75). The work highlights limitations like time-dependent visual content and dynamic visualization capabilities, outlining concrete future directions to broaden the method’s applicability across platforms and datasets, while reducing reliance on extensive fine-tuning.

Abstract

We present a method for augmenting a Large Language Model (LLM) with a combination of text and visual data to enable accurate question answering in visualization of scientific data, making conversational visualization possible. LLMs struggle with tasks like visual data interaction, as they lack contextual visual information. We address this problem by merging a text description of a visualization and dataset with snapshots of the visualization. We extract their essential features into a structured text file, highly compact, yet descriptive enough to appropriately augment the LLM with contextual information, without any fine-tuning. This approach can be applied to any visualization that is already finally rendered, as long as it is associated with some textual description.

Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data

TL;DR

Abstract

Paper Structure (24 sections, 9 figures)

This paper contains 24 sections, 9 figures.

Introduction
Related Work
Design Requirements
Method for Visioverbal Augmentation
Pre-processing: Augmentation Data Generation
Frame Sampling
Visual Information Extraction
Merging and Structuring
Description of Example Augmentation File
Runtime: Conversational Interaction
Context Window
Dual-Bot System
Proof of Concept Implementation
Demonstration
Structured Augmentation: Sea Turtles
...and 9 more sections

Figures (9)

Figure 1: In response to a question about something the user observed on the visualization, the system combines visual information, context information about the specific dataset and the colors used, and its general knowledge of climatology and sea surface temperatures to provide an accurate and informative answer.
Figure 2: The TellUs sphere at the Norrköping Visualization Center C. It provides the framework for the development of technologies enabling a talking visioverbal planet for outreach at science centers and classroom settings.
Figure 3: Simplified overview of our method for conversational interaction. Left: the LLM receives extracted information from the vision model and from the dataset's description, and generates structured augmentation data from these inputs. Right: at runtime, these augmentation data augment the LLM, providing it with the required information to respond to user queries.
Figure 4: Detailed overview of our method for conversational visualization. Left: the pre-processing step executed just once per dataset upon its addition to our database. Images are fed to OpenAI's VISION model along with a prompt engineered to get the model to extract visual information from the images, and output it as text. This is concatenated with the text description of the dataset, and fed to GPT-4o with a prompt that instructs the model to extract information from this concatenated text into a structured JSON format. Right: the main interaction loop of our system. The user's query is added to the context window which contains all previous queries (up to 20), and concatenated with the JSON structure. This is then fed to both LLMs with appropriate prompts. The user's query is always fed to the GPT-4 model along with a prompt instructing it to look for a control or navigation command expressed in natural language, and process it if it is there. If it finds such a command, it converts it to a formal one that our system can parse and deterministically execute. If the query was in fact a request for information, then GPT-4o's response is presented to the user, who can then generate another query.
Figure 5: This Science On a Sphere dataset shows the spread of a tsunami across the Pacific Ocean, and its effects on various coastlines. Since the tsunami starts from a single point before expanding over the entire ocean, the visualization changes significantly over time, so the vision model requires more samples than the usual value of two. Without enough samples, essential information present at different time points could be missed, impacting the model’s ability to capture the full range of variations.
...and 4 more figures

Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data

TL;DR

Abstract

Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data

Authors

TL;DR

Abstract

Table of Contents

Figures (9)