Table of Contents
Fetching ...

Towards Understanding Graphical Perception in Large Multimodal Models

Kai Zhang, Jianwei Yang, Jeevana Priya Inala, Chandan Singh, Jianfeng Gao, Yu Su, Chenglong Wang

TL;DR

The paper tackles the gap in understanding graphical perception in large multimodal models by proposing a theory-grounded evaluation framework that automatically generates diverse chart perception tasks from seed datasets and assesses models across chart types, visual elements, and pixel regions. It demonstrates that current SOTA models, including GPT-4o, struggle to generalize across chart types, misinterpret fundamental visual elements, and poorly cross-reference values within charts, even when explicit numerical cues are present. The authors implement a rigorous GPT-4o-aided evaluation with a calibrated rubric, and provide extensive analyses at chart, element, and pixel levels using VisText-derived data and Vega-Lite chart generation, highlighting compositional and perceptual gaps. The framework and labeled data are publicly available, offering a fine-grained diagnostic tool to guide future improvements in low-level graphical perception and perceptual reasoning for LMMs, with implications for robust chart understanding in real-world viz-heavy domains.

Abstract

Despite the promising results of large multimodal models (LMMs) in complex vision-language tasks that require knowledge, reasoning, and perception abilities together, we surprisingly found that these models struggle with simple tasks on infographics that require perception only. As existing benchmarks primarily focus on end tasks that require various abilities, they provide limited, fine-grained insights into the limitations of the models' perception abilities. To address this gap, we leverage the theory of graphical perception, an approach used to study how humans decode visual information encoded on charts and graphs, to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. With automated task generation and response evaluation designs, our framework enables comprehensive and controlled testing of LMMs' graphical perception across diverse chart types, visual elements, and task types. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three granularity levels (chart, visual element, and pixel). Our findings underscore several critical limitations of current state-of-the-art LMMs, including GPT-4o: their inability to (1) generalize across chart types, (2) understand fundamental visual elements, and (3) cross reference values within a chart. These insights provide guidance for future improvements in perception abilities of LMMs. The evaluation framework and labeled data are publicly available at https://github.com/microsoft/lmm-graphical-perception.

Towards Understanding Graphical Perception in Large Multimodal Models

TL;DR

The paper tackles the gap in understanding graphical perception in large multimodal models by proposing a theory-grounded evaluation framework that automatically generates diverse chart perception tasks from seed datasets and assesses models across chart types, visual elements, and pixel regions. It demonstrates that current SOTA models, including GPT-4o, struggle to generalize across chart types, misinterpret fundamental visual elements, and poorly cross-reference values within charts, even when explicit numerical cues are present. The authors implement a rigorous GPT-4o-aided evaluation with a calibrated rubric, and provide extensive analyses at chart, element, and pixel levels using VisText-derived data and Vega-Lite chart generation, highlighting compositional and perceptual gaps. The framework and labeled data are publicly available, offering a fine-grained diagnostic tool to guide future improvements in low-level graphical perception and perceptual reasoning for LMMs, with implications for robust chart understanding in real-world viz-heavy domains.

Abstract

Despite the promising results of large multimodal models (LMMs) in complex vision-language tasks that require knowledge, reasoning, and perception abilities together, we surprisingly found that these models struggle with simple tasks on infographics that require perception only. As existing benchmarks primarily focus on end tasks that require various abilities, they provide limited, fine-grained insights into the limitations of the models' perception abilities. To address this gap, we leverage the theory of graphical perception, an approach used to study how humans decode visual information encoded on charts and graphs, to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. With automated task generation and response evaluation designs, our framework enables comprehensive and controlled testing of LMMs' graphical perception across diverse chart types, visual elements, and task types. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three granularity levels (chart, visual element, and pixel). Our findings underscore several critical limitations of current state-of-the-art LMMs, including GPT-4o: their inability to (1) generalize across chart types, (2) understand fundamental visual elements, and (3) cross reference values within a chart. These insights provide guidance for future improvements in perception abilities of LMMs. The evaluation framework and labeled data are publicly available at https://github.com/microsoft/lmm-graphical-perception.

Paper Structure

This paper contains 27 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 2: Framework of data synthesis and evaluation. With randomly sampled 1,000 datasets as seeds, we edit the Vega-Lite program to generate 14 types of charts and use GPT-4o with textual data tables to generate 10 types of tasks and corresponding answers, resulting in a total of 140,000 inputs for each model to be evaluated. For evaluation, we consider the most representative models from four model categories and their responses are automatically evaluated by GPT-4o in text format.
  • Figure 3: Accuracy of models on different types of charts with numerical annotations given the same 10 types of tasks. The dotted line refers to the average performance by chart type and color refers to the given chart type, and T-$i$ indicates the $i$-th task detailed in Section \ref{['sec:tasks']}.
  • Figure 4: Examples of labeled regions and importance heatmaps for two models on Bar and Bar (Anno) charts. Given "What is the value of total assets in billion yuan for the year 2010?", both models successfully locate most labeled important regions on both the Bar (Anno) and Bar charts but fail to reference the correct y-axis values on the Bar chart. The correct answer is "10337.4."
  • Figure 5: Examples of labeled regions and importance heatmap of InternVL2 on a Bar (Anno) chart. Given the task "Determine the share of leisure travelers for historical locations.", InternVL2 incorrectly locates the bar for "Experience fine dining," which is closely positioned near the correct one. As a result, it generates an imperfect answer, 0.4. However, as this value is within 5% of the target value, 0.411, it is judged as correct according to the evaluation rubric considering human perception.
  • Figure 6: Overall accuracy of GPT-4o given 100 datasets with different sampled data points and different chart types.
  • ...and 4 more figures