Table of Contents
Fetching ...

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Jonathan Roberts, Kai Han, Samuel Albanie

TL;DR

GRAB introduces a challenging graph-analysis benchmark for large multimodal models to push beyond headroom limitations in existing evaluations. It comprises $3284$ synthetic questions across five tasks and $23$ graph properties, plus a $1114$-question Real subset with hand-drawn/noisy figures, and a lighter GRAB-Lite with $500$ questions. The authors evaluate $20$ frontier LMMs and find the best model attains only $21.0\%$ overall, highlighting substantial gaps in current capabilities. Through extensive ablations on task types, prompting, evaluation protocols, and plotting libraries, GRAB clarifies where models struggle and how to measure true graph-analytic reasoning. The work releases GRAB as a resource to drive progress in the visualization- and graph-analysis capabilities of future LMMs.

Abstract

Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is predominantly synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 3284 questions, covering five tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.0%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB and a lightweight GRAB-Lite to encourage progress in this important, growing domain.

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

TL;DR

GRAB introduces a challenging graph-analysis benchmark for large multimodal models to push beyond headroom limitations in existing evaluations. It comprises synthetic questions across five tasks and graph properties, plus a -question Real subset with hand-drawn/noisy figures, and a lighter GRAB-Lite with questions. The authors evaluate frontier LMMs and find the best model attains only overall, highlighting substantial gaps in current capabilities. Through extensive ablations on task types, prompting, evaluation protocols, and plotting libraries, GRAB clarifies where models struggle and how to measure true graph-analytic reasoning. The work releases GRAB as a resource to drive progress in the visualization- and graph-analysis capabilities of future LMMs.

Abstract

Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is predominantly synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 3284 questions, covering five tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.0%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB and a lightweight GRAB-Lite to encourage progress in this important, growing domain.
Paper Structure (39 sections, 8 figures, 10 tables)

This paper contains 39 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Overall performance on GRAB. Our benchmark proves challenging for frontier LMMs. The highest performing LMM attains an accuracy of just 21.0% on GRAB.
  • Figure 1: GRAB statistics
  • Figure 2: The GRaph Analysis Benchmark consists of 3284 graph analysis questions that prove challenging for frontier LMMs. The questions cover 23 graph properties organised into five core tasks: (i) Properties focuses on the analysis of features of individual functions and series; (ii) Functions and (iii) Series require computing the mean of properties across multiple functions and series; (iv) Transforms involves determining the properties of a function after it has undergone a series of transforms; and, (v) Real encompasses question styles from the other tasks in more realistic formats, including sketched on whiteboard or paper, embedded in digital contexts or with added noise.
  • Figure 3: GRAB categories
  • Figure 4: Example questions for the five tasks in the GRAB benchmark. All questions include synthetically rendered graphs apart from the Whiteboard and Paper splits of the Real task, which are hand-drawn and photographed.
  • ...and 3 more figures