Captioning Visualizations with Large Language Models (CVLLM): A Tutorial
Giuseppe Carenini, Jordon Johnson, Ali Salamatian
TL;DR
This tutorial addresses how to caption visualizations using state-of-the-art language models and LVLMs by linking InfoVis fundamentals (abstractions, marks, channels) with transformer-based NLP techniques. It outlines LLM limitations (e.g., arithmetic reasoning, planning, hallucinations) and mitigation approaches (CoT, RAG, RLHF), while highlighting LVLM progress for visualization captioning. The survey of key papers and datasets (e.g., ChartToText, VisText) illustrates advances in dataset creation, modeling, and evaluation, and points to open challenges such as domain specificity, complex visualizations, and multilingual captioning. Overall, the work guides researchers and practitioners in developing robust, accessible, and evaluated captioning systems for visualizations using cutting-edge LLM and LVLM technology.
Abstract
Automatically captioning visualizations is not new, but recent advances in large language models(LLMs) open exciting new possibilities. In this tutorial, after providing a brief review of Information Visualization (InfoVis) principles and past work in captioning, we introduce neural models and the transformer architecture used in generic LLMs. We then discuss their recent applications in InfoVis, with a focus on captioning. Additionally, we explore promising future directions in this field.
