GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

Shangyu Xing; Changhao Xiang; Yuteng Han; Yifan Yue; Zhen Wu; Xinyu Liu; Zhangtai Wu; Fei Zhao; Xinyu Dai

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

Shangyu Xing, Changhao Xiang, Yuteng Han, Yifan Yue, Zhen Wu, Xinyu Liu, Zhangtai Wu, Fei Zhao, Xinyu Dai

TL;DR

GePBench addresses a foundational gap in multimodal perception by creating a geometry-focused benchmark with 80K figures and 285K questions across six aspects, enabling rigorous evaluation of geometric perception in MLLMs. The dataset is generated via a structured description engine, figure rendering with realistic noise, and template-based QA, and is used to benchmark 20 models, revealing pronounced gaps between humans and current systems, especially in size and location tasks. A notable contribution is the LLaVA-GeP model, pretrained with GePBench data, which demonstrates transferable gains on downstream tasks such as math and chart interpretation, underscoring the importance of geometric perception as a core building block for multimodal understanding. The work highlights encoder design implications, and shows that augmenting MLLMs with geometric-perception data can yield broad benefits, driving future research in foundational visual-spatial reasoning.

Abstract

Multimodal large language models (MLLMs) have made significant progress in integrating visual and linguistic understanding. Existing benchmarks typically focus on high-level semantic capabilities, such as scene understanding and visual reasoning, but often overlook a crucial, foundational ability: geometric perception. Geometric perception involves understanding geometric shapes, structures, and spatial relationships, which are essential for supporting higher-level semantic tasks. Despite its importance, this capability remains underexplored in current MLLM research. To address this gap, we introduce GePBench, a novel benchmark designed to assess the geometric perception abilities of MLLMs. Our extensive evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks. Furthermore, we show that models trained with GePBench data demonstrate substantial improvements on a wide range of benchmark tasks, highlighting the critical role of geometric perception in enabling advanced multimodal applications. Our code and datasets will be publicly available.

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (51 sections, 10 figures, 4 tables, 1 algorithm)

This paper contains 51 sections, 10 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Multimodal Large Language Models
Multimodal Benchmarks
GePBench
Structured Description Generation
Figure Rendering
Question-Answer Generation
Statistics and Analysis
Experiments
Experimental Setup
Evaluated models.
Evaluation setup.
Main Result
Both closed-source and open-source models face considerable challenges on GePBench.
...and 36 more sections

Figures (10)

Figure 1: Examples for the different aspects of GePBench.
Figure 2: An overview of the data engine of GePBench.
Figure 3: Key data distributions of GePBench.
Figure 4: Performance comparison of representative models on GePBench and OpenCompass. Larger dots indicate larger model sizes. Passing threshold denotes 60.0% accuracy.
Figure 5: Comparison of the average accuracy of representative models on questions categorized by different number of geometric shapes in the figure.
...and 5 more figures

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

TL;DR

Abstract

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)