GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models
Shangyu Xing, Changhao Xiang, Yuteng Han, Yifan Yue, Zhen Wu, Xinyu Liu, Zhangtai Wu, Fei Zhao, Xinyu Dai
TL;DR
GePBench addresses a foundational gap in multimodal perception by creating a geometry-focused benchmark with 80K figures and 285K questions across six aspects, enabling rigorous evaluation of geometric perception in MLLMs. The dataset is generated via a structured description engine, figure rendering with realistic noise, and template-based QA, and is used to benchmark 20 models, revealing pronounced gaps between humans and current systems, especially in size and location tasks. A notable contribution is the LLaVA-GeP model, pretrained with GePBench data, which demonstrates transferable gains on downstream tasks such as math and chart interpretation, underscoring the importance of geometric perception as a core building block for multimodal understanding. The work highlights encoder design implications, and shows that augmenting MLLMs with geometric-perception data can yield broad benefits, driving future research in foundational visual-spatial reasoning.
Abstract
Multimodal large language models (MLLMs) have made significant progress in integrating visual and linguistic understanding. Existing benchmarks typically focus on high-level semantic capabilities, such as scene understanding and visual reasoning, but often overlook a crucial, foundational ability: geometric perception. Geometric perception involves understanding geometric shapes, structures, and spatial relationships, which are essential for supporting higher-level semantic tasks. Despite its importance, this capability remains underexplored in current MLLM research. To address this gap, we introduce GePBench, a novel benchmark designed to assess the geometric perception abilities of MLLMs. Our extensive evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks. Furthermore, we show that models trained with GePBench data demonstrate substantial improvements on a wide range of benchmark tasks, highlighting the critical role of geometric perception in enabling advanced multimodal applications. Our code and datasets will be publicly available.
