Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data
Omar Mena, Alexandre Kouyoumdjian, Lonni Besançon, Michael Gleicher, Ivan Viola, Anders Ynnerman
TL;DR
This work tackles the problem of LLM blindness to visual data by introducing visioverbal augmentation, a method that pre-processes rendered visualizations and their textual descriptions into a compact JSON format to augment LLM prompts. The approach uses a two-stage pipeline: a one-time preprocessing per dataset to extract visual cues from frames and merge them with text, and a runtime loop that runs the conversation with a dual-LLM setup (information-delivery GPT-4o and command-issuing GPT-4). The proof-of-concept, demonstrated on Science On a Sphere geospatial visualizations within the TellUs framework, shows that structured augmentation substantially improves QA accuracy over baselines and achieves high user usability (SUS ~83.75). The work highlights limitations like time-dependent visual content and dynamic visualization capabilities, outlining concrete future directions to broaden the method’s applicability across platforms and datasets, while reducing reliance on extensive fine-tuning.
Abstract
We present a method for augmenting a Large Language Model (LLM) with a combination of text and visual data to enable accurate question answering in visualization of scientific data, making conversational visualization possible. LLMs struggle with tasks like visual data interaction, as they lack contextual visual information. We address this problem by merging a text description of a visualization and dataset with snapshots of the visualization. We extract their essential features into a structured text file, highly compact, yet descriptive enough to appropriately augment the LLM with contextual information, without any fine-tuning. This approach can be applied to any visualization that is already finally rendered, as long as it is associated with some textual description.
