Evaluating Graphical Perception Capabilities of Vision Transformers

Poonam Poonam; Pere-Pau Vázquez; Timo Ropinski

Evaluating Graphical Perception Capabilities of Vision Transformers

Poonam Poonam, Pere-Pau Vázquez, Timo Ropinski

TL;DR

This work investigates the performance of ViTs in elementary visual judgment tasks inspired by the foundational studies of Cleveland and McGill, which quantified the accuracy of human perception across different visual encodings.

Abstract

Vision Transformers, ViTs, have emerged as a powerful alternative to convolutional neural networks, CNNs, in a variety of image-based tasks. While CNNs have previously been evaluated for their ability to perform graphical perception tasks, which are essential for interpreting visualizations, the perceptual capabilities of ViTs remain largely unexplored. In this work, we investigate the performance of ViTs in elementary visual judgment tasks inspired by the foundational studies of Cleveland and McGill, which quantified the accuracy of human perception across different visual encodings. Inspired by their study, we benchmark ViTs against CNNs and human participants in a series of controlled graphical perception tasks. Our results reveal that, although ViTs demonstrate strong performance in general vision tasks, their alignment with human-like graphical perception in the visualization domain is limited. This study highlights key perceptual gaps and points to important considerations for the application of ViTs in visualization systems and graphical perceptual modeling.

Evaluating Graphical Perception Capabilities of Vision Transformers

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 10 figures, 10 tables)

This paper contains 18 sections, 1 equation, 10 figures, 10 tables.

Introduction
Related Work
Methods
Low-level visual tasks
Data
Vision Transformer Architectures
Vanilla Vision Transformer
Convolutional Vision Transformer
Swin Transformer
Training Procedure
Experiments and Results
Performance Analysis of Humans vs. ViTs
Performance Analysis of CNNs vs. ViTs
Discussion
Humans vs. ViTs
...and 3 more sections

Figures (10)

Figure 1: Illustration of the tested ViT architectures: vViT (a), CvT (b), and Swin (c). Input image patches (green) can be projected linearly (vViT) or through convolution (CvT). Transformer blocks with self-attention (blue) process the projected patches, before feed forward networks and MLPs (orange) generate the output. CvT additionally uses convolutional embedding (pink). Swin transformer blocks (gray) employ shifted windowing. All models conclude with class label outputs (labels are parameters used during the generation of the image.)
Figure 2: Comparison of elementary perceptual task rankings across humans, CNNs, and ViTs. Rankings are derived based MLAE, with lower positions indicating better perceptual accuracy. The shading intensity of each box reflects rank: darker tones indicate higher accuracy (that is, rank 1), while lighter tones correspond to lower ranks. Human rankings are based on Cleveland and McGill cleveland1984graphical, while CNNs randking from Haehn et al. haehn2018evaluating, and ViT rankings are based on our retrained models. Lines connect identical perceptual tasks across the three groups. Notably, ViTs exhibit most deviation—particularly in their treatment of position and length—highlighting greater ambiguity and weaker perceptual alignment with human judgments.
Figure 3: Cross-generalization performance of the Swin Transformer across different parameterizations of elementary perceptual tasks. Each matrix corresponds to a specific task (for example, Position, Length, Area, Shading). Rows represent the parameterization used for training, while columns represent the parameterization used for testing. The top row in each matrix shows within-task (in-parameter) performance, with diagonal entries indicating baseline accuracy when training and testing on the same parameter configuration. Off-diagonal entries indicate the model’s ability to generalize to new parameter settings (that is, changes in object position, size, or shape). Higher values off-diagonal reflect poor generalization and sensitivity to unseen visual variations.
Figure 4: Comparison of Regression Errors for Elementary Perceptual Tasks: elementary perceptual tasks experiment. Comparison of visual stimuli (left) and MLAE scores with 95$\%$ confidence intervals (lower is better) for various ViTs and humans (right). Swin generally aligns more closely with human performance for simpler encodings, while all ViTs diverge on complex features like curvature and shading. Human scores derived from Cleveland and McGill, and Hahen et al.'s studies cleveland1984graphicalhaehn2018evaluating.
Figure 5: Performance in Position-Angle Estimation: Comparison of visual stimuli (left) and MLAE scores with 95$\%$ confidence intervals (lower is better) for various ViTs and humans (right). Human scores derived from Cleveland and McGill, and Hahen et al.'s studies cleveland1984graphicalhaehn2018evaluating. All ViTs underperform compared to human baselines.
...and 5 more figures

Evaluating Graphical Perception Capabilities of Vision Transformers

TL;DR

Abstract

Evaluating Graphical Perception Capabilities of Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (10)