Table of Contents
Fetching ...

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

Shangyu Xing, Changhao Xiang, Yuteng Han, Yifan Yue, Zhen Wu, Xinyu Liu, Zhangtai Wu, Fei Zhao, Xinyu Dai

TL;DR

GePBench addresses a foundational gap in multimodal perception by creating a geometry-focused benchmark with 80K figures and 285K questions across six aspects, enabling rigorous evaluation of geometric perception in MLLMs. The dataset is generated via a structured description engine, figure rendering with realistic noise, and template-based QA, and is used to benchmark 20 models, revealing pronounced gaps between humans and current systems, especially in size and location tasks. A notable contribution is the LLaVA-GeP model, pretrained with GePBench data, which demonstrates transferable gains on downstream tasks such as math and chart interpretation, underscoring the importance of geometric perception as a core building block for multimodal understanding. The work highlights encoder design implications, and shows that augmenting MLLMs with geometric-perception data can yield broad benefits, driving future research in foundational visual-spatial reasoning.

Abstract

Multimodal large language models (MLLMs) have made significant progress in integrating visual and linguistic understanding. Existing benchmarks typically focus on high-level semantic capabilities, such as scene understanding and visual reasoning, but often overlook a crucial, foundational ability: geometric perception. Geometric perception involves understanding geometric shapes, structures, and spatial relationships, which are essential for supporting higher-level semantic tasks. Despite its importance, this capability remains underexplored in current MLLM research. To address this gap, we introduce GePBench, a novel benchmark designed to assess the geometric perception abilities of MLLMs. Our extensive evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks. Furthermore, we show that models trained with GePBench data demonstrate substantial improvements on a wide range of benchmark tasks, highlighting the critical role of geometric perception in enabling advanced multimodal applications. Our code and datasets will be publicly available.

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

TL;DR

GePBench addresses a foundational gap in multimodal perception by creating a geometry-focused benchmark with 80K figures and 285K questions across six aspects, enabling rigorous evaluation of geometric perception in MLLMs. The dataset is generated via a structured description engine, figure rendering with realistic noise, and template-based QA, and is used to benchmark 20 models, revealing pronounced gaps between humans and current systems, especially in size and location tasks. A notable contribution is the LLaVA-GeP model, pretrained with GePBench data, which demonstrates transferable gains on downstream tasks such as math and chart interpretation, underscoring the importance of geometric perception as a core building block for multimodal understanding. The work highlights encoder design implications, and shows that augmenting MLLMs with geometric-perception data can yield broad benefits, driving future research in foundational visual-spatial reasoning.

Abstract

Multimodal large language models (MLLMs) have made significant progress in integrating visual and linguistic understanding. Existing benchmarks typically focus on high-level semantic capabilities, such as scene understanding and visual reasoning, but often overlook a crucial, foundational ability: geometric perception. Geometric perception involves understanding geometric shapes, structures, and spatial relationships, which are essential for supporting higher-level semantic tasks. Despite its importance, this capability remains underexplored in current MLLM research. To address this gap, we introduce GePBench, a novel benchmark designed to assess the geometric perception abilities of MLLMs. Our extensive evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks. Furthermore, we show that models trained with GePBench data demonstrate substantial improvements on a wide range of benchmark tasks, highlighting the critical role of geometric perception in enabling advanced multimodal applications. Our code and datasets will be publicly available.
Paper Structure (51 sections, 10 figures, 4 tables, 1 algorithm)

This paper contains 51 sections, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Examples for the different aspects of GePBench.
  • Figure 2: An overview of the data engine of GePBench.
  • Figure 3: Key data distributions of GePBench.
  • Figure 4: Performance comparison of representative models on GePBench and OpenCompass. Larger dots indicate larger model sizes. Passing threshold denotes 60.0% accuracy.
  • Figure 5: Comparison of the average accuracy of representative models on questions categorized by different number of geometric shapes in the figure.
  • ...and 5 more figures