Table of Contents
Fetching ...

GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

Jiaxin Zhang, Zhongzhi Li, Mingliang Zhang, Fei Yin, Chenglin Liu, Yashar Moshfeghi

TL;DR

GeoEval introduces a comprehensive benchmark for evaluating geometry problem-solving in LLMs and MMs, combining a main 2000-problem set with backward, augmented, and hard subsets to probe reasoning, generalization, and robustness. The study shows math-pretrained models (e.g., WizardMath) excel on regular geometry problems but struggle with harder, novel items, while GPT-series models gain from rephrasing and diagram descriptions. A zero-shot evaluation with a post-processing extraction pipeline reveals that multimodal models, especially those with diagram handling (e.g., GPT-4V, mPLUG-Owl2), can outperform text-only baselines on diagram-containing tasks, though performance still degrades with length and complexity. The results highlight the value of domain-specific pre-training, diagram descriptions, and data-generation strategies (backward reasoning, augmentation) for advancing geometry reasoning in AI, while pointing to future work on richer reasoning explanations and broader external-knowledge integration.

Abstract

Recent advancements in large language models (LLMs) and multi-modal models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2,000 problems, a 750 problems subset focusing on backward reasoning, an augmented subset of 2,000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs in solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67\% accuracy rate on the main subset but only a 6.00\% accuracy on the hard subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities.

GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

TL;DR

GeoEval introduces a comprehensive benchmark for evaluating geometry problem-solving in LLMs and MMs, combining a main 2000-problem set with backward, augmented, and hard subsets to probe reasoning, generalization, and robustness. The study shows math-pretrained models (e.g., WizardMath) excel on regular geometry problems but struggle with harder, novel items, while GPT-series models gain from rephrasing and diagram descriptions. A zero-shot evaluation with a post-processing extraction pipeline reveals that multimodal models, especially those with diagram handling (e.g., GPT-4V, mPLUG-Owl2), can outperform text-only baselines on diagram-containing tasks, though performance still degrades with length and complexity. The results highlight the value of domain-specific pre-training, diagram descriptions, and data-generation strategies (backward reasoning, augmentation) for advancing geometry reasoning in AI, while pointing to future work on richer reasoning explanations and broader external-knowledge integration.

Abstract

Recent advancements in large language models (LLMs) and multi-modal models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2,000 problems, a 750 problems subset focusing on backward reasoning, an augmented subset of 2,000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs in solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67\% accuracy rate on the main subset but only a 6.00\% accuracy on the hard subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities.
Paper Structure (46 sections, 13 figures, 11 tables)

This paper contains 46 sections, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Examples of the GeoEval benchmark.
  • Figure 2: Detailed accuracy scores for models across various academic subjects.
  • Figure 3: Comparison of models requiring external constants ("w" in blue color) and those do not ("w/o" in orange color).
  • Figure 4: Models performances on GeoEval-2000 subset according to different question lengths.
  • Figure 5: Model performances on GeoEval-2000 subset according to different complexity levels.
  • ...and 8 more figures