Table of Contents
Fetching ...

The Visualization JUDGE : Can Multimodal Foundation Models Guide Visualization Design Through Visual Perception?

Matthew Berger, Shusen Liu

TL;DR

The paper argues that multimodal foundation models (MFMs) can guide visualization design by treating perception-enabled MFMs as judges that observe visualizations, reason about goals, and propose actionable improvements. It formalizes two core pathways: text-to-image density estimation for gradient-based optimization and multimodal LLMs for high-level evaluation and iterative guidance, with explicit problem formulations such as maximizing $\log p_\theta(V(\mathcal{D},\mathbf{v})|c)$ and leveraging $I\sim p(I|c)$ as a density signal. Key contributions include definitions and framings of MFMs as visualization judges (evidence, analysis, action), a differentiable visualization design pipeline, and discussions of adaptation via fine-tuning, prompting, and zeroth-order optimization, along with considerations of alignment, robustness, and design diversity. The work lays out practical research directions for integrating MFMs into visual analytics workflows, potentially enabling more robust, diverse, and machine-aware visualization design. Overall, it provides a blueprint for leveraging perception and language understanding in MFMs to augment, rather than replace, human visualization expertise, with emphasis on alignment, evaluation, and iterative design loops.

Abstract

Foundation models for vision and language are the basis of AI applications across numerous sectors of society. The success of these models stems from their ability to mimic human capabilities, namely visual perception in vision models, and analytical reasoning in large language models. As visual perception and analysis are fundamental to data visualization, in this position paper we ask: how can we harness foundation models to advance progress in visualization design? Specifically, how can multimodal foundation models (MFMs) guide visualization design through visual perception? We approach these questions by investigating the effectiveness of MFMs for perceiving visualization, and formalizing the overall visualization design and optimization space. Specifically, we think that MFMs can best be viewed as judges, equipped with the ability to criticize visualizations, and provide us with actions on how to improve a visualization. We provide a deeper characterization for text-to-image generative models, and multi-modal large language models, organized by what these models provide as output, and how to utilize the output for guiding design decisions. We hope that our perspective can inspire researchers in visualization on how to approach MFMs for visualization design.

The Visualization JUDGE : Can Multimodal Foundation Models Guide Visualization Design Through Visual Perception?

TL;DR

The paper argues that multimodal foundation models (MFMs) can guide visualization design by treating perception-enabled MFMs as judges that observe visualizations, reason about goals, and propose actionable improvements. It formalizes two core pathways: text-to-image density estimation for gradient-based optimization and multimodal LLMs for high-level evaluation and iterative guidance, with explicit problem formulations such as maximizing and leveraging as a density signal. Key contributions include definitions and framings of MFMs as visualization judges (evidence, analysis, action), a differentiable visualization design pipeline, and discussions of adaptation via fine-tuning, prompting, and zeroth-order optimization, along with considerations of alignment, robustness, and design diversity. The work lays out practical research directions for integrating MFMs into visual analytics workflows, potentially enabling more robust, diverse, and machine-aware visualization design. Overall, it provides a blueprint for leveraging perception and language understanding in MFMs to augment, rather than replace, human visualization expertise, with emphasis on alignment, evaluation, and iterative design loops.

Abstract

Foundation models for vision and language are the basis of AI applications across numerous sectors of society. The success of these models stems from their ability to mimic human capabilities, namely visual perception in vision models, and analytical reasoning in large language models. As visual perception and analysis are fundamental to data visualization, in this position paper we ask: how can we harness foundation models to advance progress in visualization design? Specifically, how can multimodal foundation models (MFMs) guide visualization design through visual perception? We approach these questions by investigating the effectiveness of MFMs for perceiving visualization, and formalizing the overall visualization design and optimization space. Specifically, we think that MFMs can best be viewed as judges, equipped with the ability to criticize visualizations, and provide us with actions on how to improve a visualization. We provide a deeper characterization for text-to-image generative models, and multi-modal large language models, organized by what these models provide as output, and how to utilize the output for guiding design decisions. We hope that our perspective can inspire researchers in visualization on how to approach MFMs for visualization design.
Paper Structure (21 sections, 1 equation, 6 figures)

This paper contains 21 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Unlike large language models (middle), multimodal foundation models (MFM) (bottom) can process both language and vision. Where humans can think about possible visual encodings for a particular dataset & task (top), MFMs can similarly reason about visualizations, represented as images, in designing visualizations.
  • Figure 2: The range of MFMs for visualization design presents a trade off between (1) constraints on the design space, and (2) methods available for optimization. For instance, T2I models that report the likelihood of a visualization offer high flexibility in optimization (gradient-based methods), but are limited in what can be optimized (low-level parameters such as mark size & opacity). Conversely, as MLLMs are more expressive in their outputs, they permit exploring much more of the visualization design space, but our choices for optimization within this space are more limited.
  • Figure 3: Our prior work jeong2024textbased considers how to find visualization parameters -- transfer functions for color and opacity -- that give visualizations which are compatible with a user's description. Here we show how different styles of this volumetric data, representing a tree, can be depicted through simply changing the descriptions of the volume, or what the user aims to see in the data.
  • Figure 4: Can an MLLM make adjustments to the visualization parameters automatically based on (1) their visual understanding of the visualization output, and (2) the user instructions in natural language? The results here are the AVA work liu2024ava, where we demonstrate an autonomous visualization agent can be designed by allowing the MLLM to iteratively refine the existing visualization.
  • Figure :
  • ...and 1 more figures