Table of Contents
Fetching ...

Visualization Literacy of Multimodal Large Language Models: A Comparative Study

Zhimin Li, Haichao Miao, Valerio Pascucci, Shusen Liu

TL;DR

This work introduces visualization literacy as a framework to evaluate multimodal large language models (MLLMs) on visualization understanding, leveraging VLAT and Mini-VLAT datasets. It conducts a cross-model comparison between state-of-the-art MLLMs (GPT4-o, Claude 3 Opus, Gemini 1.5 Pro) and human baselines, using a simple, non-chain-of-thought prompting setup with multiple runs per question. The study finds that MLLMs show competitive performance and surpass humans on certain tasks such as identifying correlations, clusters, and hierarchical structures, while revealing distinct failure patterns—most notably color-semantics confusion, value retrieval, and pie-chart interpretation. These results provide actionable insights into the capabilities and limitations of vision-enabled LLMs for visualization tasks and suggest directions for improving evaluation and prompting techniques in visualization research.

Abstract

The recent introduction of multimodal large language models (MLLMs) combine the inherent power of large language models (LLMs) with the renewed capabilities to reason about the multimodal context. The potential usage scenarios for MLLMs significantly outpace their text-only counterparts. Many recent works in visualization have demonstrated MLLMs' capability to understand and interpret visualization results and explain the content of the visualization to users in natural language. In the machine learning community, the general vision capabilities of MLLMs have been evaluated and tested through various visual understanding benchmarks. However, the ability of MLLMs to accomplish specific visualization tasks based on visual perception has not been properly explored and evaluated, particularly, from a visualization-centric perspective. In this work, we aim to fill the gap by utilizing the concept of visualization literacy to evaluate MLLMs. We assess MLLMs' performance over two popular visualization literacy evaluation datasets (VLAT and mini-VLAT). Under the framework of visualization literacy, we develop a general setup to compare different multimodal large language models (e.g., GPT4-o, Claude 3 Opus, Gemini 1.5 Pro) as well as against existing human baselines. Our study demonstrates MLLMs' competitive performance in visualization literacy, where they outperform humans in certain tasks such as identifying correlations, clusters, and hierarchical structures.

Visualization Literacy of Multimodal Large Language Models: A Comparative Study

TL;DR

This work introduces visualization literacy as a framework to evaluate multimodal large language models (MLLMs) on visualization understanding, leveraging VLAT and Mini-VLAT datasets. It conducts a cross-model comparison between state-of-the-art MLLMs (GPT4-o, Claude 3 Opus, Gemini 1.5 Pro) and human baselines, using a simple, non-chain-of-thought prompting setup with multiple runs per question. The study finds that MLLMs show competitive performance and surpass humans on certain tasks such as identifying correlations, clusters, and hierarchical structures, while revealing distinct failure patterns—most notably color-semantics confusion, value retrieval, and pie-chart interpretation. These results provide actionable insights into the capabilities and limitations of vision-enabled LLMs for visualization tasks and suggest directions for improving evaluation and prompting techniques in visualization research.

Abstract

The recent introduction of multimodal large language models (MLLMs) combine the inherent power of large language models (LLMs) with the renewed capabilities to reason about the multimodal context. The potential usage scenarios for MLLMs significantly outpace their text-only counterparts. Many recent works in visualization have demonstrated MLLMs' capability to understand and interpret visualization results and explain the content of the visualization to users in natural language. In the machine learning community, the general vision capabilities of MLLMs have been evaluated and tested through various visual understanding benchmarks. However, the ability of MLLMs to accomplish specific visualization tasks based on visual perception has not been properly explored and evaluated, particularly, from a visualization-centric perspective. In this work, we aim to fill the gap by utilizing the concept of visualization literacy to evaluate MLLMs. We assess MLLMs' performance over two popular visualization literacy evaluation datasets (VLAT and mini-VLAT). Under the framework of visualization literacy, we develop a general setup to compare different multimodal large language models (e.g., GPT4-o, Claude 3 Opus, Gemini 1.5 Pro) as well as against existing human baselines. Our study demonstrates MLLMs' competitive performance in visualization literacy, where they outperform humans in certain tasks such as identifying correlations, clusters, and hierarchical structures.
Paper Structure (17 sections, 7 figures, 3 tables)

This paper contains 17 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The result shows the performance of different multimodal large language models over the Mini-VLAT dataset.
  • Figure 2: The language model may confused by the ambiguous colors and resort to their prior knowledge root in color-semantic association mukherjee2024estimating, which leads to incorrect answers.
  • Figure 3: The language model has limited ability to retrieve value.
  • Figure 4: The language model has limited ability to retrieve the arc visual encoding value.
  • Figure 5: Average score distribution of 53 questions from human and GPT4-o.
  • ...and 2 more figures