Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models

Spyridon Mouselinos; Henryk Michalewski; Mateusz Malinowski

Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models

Spyridon Mouselinos, Henryk Michalewski, Mateusz Malinowski

TL;DR

This work investigates the geometric reasoning gap in Large Language Models by analyzing constructive geometry tasks and identifying biases, spatial reasoning deficiencies, and naming biases. It introduces a simulacra-based multi-agent framework with adaptive prompting, a Visual Relations Prompt, and target-name renaming to enhance spatial reasoning and tool usage. Through extensive experiments on Euclidea and cross-dataset tests, the approach yields substantial gains over single-agent baselines, with notable improvements for open models and strong generalization to other math datasets. The study suggests a promising direction for designing multi-agent, role-specialized LLM systems that integrate planning, symbol manipulation, and spatial understanding, while acknowledging cost and generalization limitations.

Abstract

Large Language Models (LLMs) demonstrate ever-increasing abilities in mathematical and algorithmic tasks, yet their geometric reasoning skills are underexplored. We investigate LLMs' abilities in constructive geometric problem-solving one of the most fundamental steps in the development of human mathematical reasoning. Our work reveals notable challenges that the state-of-the-art LLMs face in this domain despite many successes in similar areas. LLMs exhibit biases in target variable selection and struggle with 2D spatial relationships, often misrepresenting and hallucinating objects and their placements. To this end, we introduce a framework that formulates an LLMs-based multi-agents system that enhances their existing reasoning potential by conducting an internal dialogue. This work underscores LLMs' current limitations in geometric reasoning and improves geometric reasoning capabilities through self-correction, collaboration, and diverse role specializations.

Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models

TL;DR

Abstract

Paper Structure (26 sections, 4 figures, 11 tables)

This paper contains 26 sections, 4 figures, 11 tables.

Introduction
Related Work
Preliminaries
Method
Prompt Engineering for Geometric Reasoning
From Single Models To Simulacra
Enhancing Spatial Awareness with the Visual Relations Prompt
Mitigating Naming Biases
Experiments
Ablation Studies
Overcoming hallucinations and context overdependence
Effectiveness of domain and role division
Visual Aids in Spatial Reasoning
Impact of Geometry Nomenclature on LLMs
Generalisation to different datasets
...and 11 more sections

Figures (4)

Figure 1: Drawing inspirations from the Ancient Greek Academy, we divide the reasoning pipeline into three stages. From left to right: The current geometric construction task is broken down into the image, its task description, and available tools. Our framework employs four LLM-based agents, each prompted with a specific role and task. A collaborative multi-round discussion is conducted where the geometric construction is effectively solved, reflecting the Academy's collective approach towards problem-solving and reasoning.
Figure 2: The adaptive few-shot mechanism. When facing the problem Construct a 30-degree angle given a ray, we proceed to filter our knowledge base (top-right) and then either return the top five most similar results to build our prompt - Adaptive-Shot (ST) or prompt the LLM to filter them out by itself - Adaptive-Shot (Self).
Figure 3: Visualized GPT4 reasoning paths depending on the name of the target variable. Top left: $Target=C$, Top right: $Target=D$, Bottom left: $Target=E$ and Bottom right: $Target=X$
Figure 4: VRP extraction using GPT4-V

Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models

TL;DR

Abstract

Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)