Towards Understanding Graphical Perception in Large Multimodal Models
Kai Zhang, Jianwei Yang, Jeevana Priya Inala, Chandan Singh, Jianfeng Gao, Yu Su, Chenglong Wang
TL;DR
The paper tackles the gap in understanding graphical perception in large multimodal models by proposing a theory-grounded evaluation framework that automatically generates diverse chart perception tasks from seed datasets and assesses models across chart types, visual elements, and pixel regions. It demonstrates that current SOTA models, including GPT-4o, struggle to generalize across chart types, misinterpret fundamental visual elements, and poorly cross-reference values within charts, even when explicit numerical cues are present. The authors implement a rigorous GPT-4o-aided evaluation with a calibrated rubric, and provide extensive analyses at chart, element, and pixel levels using VisText-derived data and Vega-Lite chart generation, highlighting compositional and perceptual gaps. The framework and labeled data are publicly available, offering a fine-grained diagnostic tool to guide future improvements in low-level graphical perception and perceptual reasoning for LMMs, with implications for robust chart understanding in real-world viz-heavy domains.
Abstract
Despite the promising results of large multimodal models (LMMs) in complex vision-language tasks that require knowledge, reasoning, and perception abilities together, we surprisingly found that these models struggle with simple tasks on infographics that require perception only. As existing benchmarks primarily focus on end tasks that require various abilities, they provide limited, fine-grained insights into the limitations of the models' perception abilities. To address this gap, we leverage the theory of graphical perception, an approach used to study how humans decode visual information encoded on charts and graphs, to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. With automated task generation and response evaluation designs, our framework enables comprehensive and controlled testing of LMMs' graphical perception across diverse chart types, visual elements, and task types. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three granularity levels (chart, visual element, and pixel). Our findings underscore several critical limitations of current state-of-the-art LMMs, including GPT-4o: their inability to (1) generalize across chart types, (2) understand fundamental visual elements, and (3) cross reference values within a chart. These insights provide guidance for future improvements in perception abilities of LMMs. The evaluation framework and labeled data are publicly available at https://github.com/microsoft/lmm-graphical-perception.
