Evaluating Graphical Perception with Multimodal LLMs
Rami Huu Nguyen, Kenichi Maeda, Mahsa Geshvadi, Daniel Haehn
TL;DR
This work investigates graphical perception with multimodal large language models by replicating Cleveland and McGill's classic experiments and comparing pretrained, fine-tuned, and zero-shot prompting approaches across five perceptual tasks. It systematically evaluates models with the MLAE metric (and related errors) on elementary features, position-length/angle, framed vs unframed cues, and a Weber-like point-cloud task, revealing that pretrained MLLMs often outperform humans and fine-tuned variants, while zero-shot prompts bolster performance. The study highlights task-specific strengths and weaknesses of MLLMs in data visualization, showing curvature and framing tasks can be more tractable for these models, whereas others align more closely with human perception or underperform compared to pretrained baselines. Overall, the results underscore the potential and limits of MLLMs for automating graphical-perception analyses and point to future work in scaling fine-tuning, expanding visualization types, and incorporating chain-of-thought reasoning.
Abstract
Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images. Despite these advancements, accurately regressing values in charts remains an underexplored area for MLLMs. For visualization, how do MLLMs perform when applied to graphical perception tasks? Our paper investigates this question by reproducing Cleveland and McGill's seminal 1984 experiment and comparing it against human task performance. Our study primarily evaluates fine-tuned and pretrained models and zero-shot prompting to determine if they closely match human graphical perception. Our findings highlight that MLLMs outperform human task performance in some cases but not in others. We highlight the results of all experiments to foster an understanding of where MLLMs succeed and fail when applied to data visualization.
