Table of Contents
Fetching ...

Evaluating Graphical Perception with Multimodal LLMs

Rami Huu Nguyen, Kenichi Maeda, Mahsa Geshvadi, Daniel Haehn

TL;DR

This work investigates graphical perception with multimodal large language models by replicating Cleveland and McGill's classic experiments and comparing pretrained, fine-tuned, and zero-shot prompting approaches across five perceptual tasks. It systematically evaluates models with the MLAE metric (and related errors) on elementary features, position-length/angle, framed vs unframed cues, and a Weber-like point-cloud task, revealing that pretrained MLLMs often outperform humans and fine-tuned variants, while zero-shot prompts bolster performance. The study highlights task-specific strengths and weaknesses of MLLMs in data visualization, showing curvature and framing tasks can be more tractable for these models, whereas others align more closely with human perception or underperform compared to pretrained baselines. Overall, the results underscore the potential and limits of MLLMs for automating graphical-perception analyses and point to future work in scaling fine-tuning, expanding visualization types, and incorporating chain-of-thought reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images. Despite these advancements, accurately regressing values in charts remains an underexplored area for MLLMs. For visualization, how do MLLMs perform when applied to graphical perception tasks? Our paper investigates this question by reproducing Cleveland and McGill's seminal 1984 experiment and comparing it against human task performance. Our study primarily evaluates fine-tuned and pretrained models and zero-shot prompting to determine if they closely match human graphical perception. Our findings highlight that MLLMs outperform human task performance in some cases but not in others. We highlight the results of all experiments to foster an understanding of where MLLMs succeed and fail when applied to data visualization.

Evaluating Graphical Perception with Multimodal LLMs

TL;DR

This work investigates graphical perception with multimodal large language models by replicating Cleveland and McGill's classic experiments and comparing pretrained, fine-tuned, and zero-shot prompting approaches across five perceptual tasks. It systematically evaluates models with the MLAE metric (and related errors) on elementary features, position-length/angle, framed vs unframed cues, and a Weber-like point-cloud task, revealing that pretrained MLLMs often outperform humans and fine-tuned variants, while zero-shot prompts bolster performance. The study highlights task-specific strengths and weaknesses of MLLMs in data visualization, showing curvature and framing tasks can be more tractable for these models, whereas others align more closely with human perception or underperform compared to pretrained baselines. Overall, the results underscore the potential and limits of MLLMs for automating graphical-perception analyses and point to future work in scaling fine-tuning, expanding visualization types, and incorporating chain-of-thought reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images. Despite these advancements, accurately regressing values in charts remains an underexplored area for MLLMs. For visualization, how do MLLMs perform when applied to graphical perception tasks? Our paper investigates this question by reproducing Cleveland and McGill's seminal 1984 experiment and comparing it against human task performance. Our study primarily evaluates fine-tuned and pretrained models and zero-shot prompting to determine if they closely match human graphical perception. Our findings highlight that MLLMs outperform human task performance in some cases but not in others. We highlight the results of all experiments to foster an understanding of where MLLMs succeed and fail when applied to data visualization.

Paper Structure

This paper contains 19 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: Elementary perceptual tasks results for the most complex task parameterization. In each column: Left: Example stimuli image. Right: MLAE and bootstrapped 95% confidence intervals for different networks. Lower MLAE scores are better.
  • Figure 2: Computational results of the position-angle experiment. Left: Example stimuli. Right: MLAE and bootstrapped 95% confidence intervals (the lower, the better)
  • Figure 3: Computational results of the position-length experiment. Left: Type 1–5 stimuli for divided and grouped bar charts (as per Cleveland and McGill). Right: MLAE and bootstrapped 95% confidence intervals of our networks.
  • Figure 4: Computational results of the bars-and-framed-rectangles experiment. Left: Stimuli of two bars for length judgment (bottom) following Cleveland and McGill’s setting. Perceiving which bar is longer is significantly easier for humans when a frame is added (top).
  • Figure 5: Computational results of the point cloud experiment. Left: We create 2D point clouds with 10, 100, and 1000 initial dots. Then, we add up to 10 new dots. For humans, it is possible to estimate how many dots are added if there are initially 10 points, but it is impossible to see how many dots are added when starting with 1000 dots.