Table of Contents
Fetching ...

Plane Geometry Problem Solving with Multi-modal Reasoning: A Survey

Seunghyuk Cho, Zhenyue Qin, Yang Liu, Youngbin Choi, Seungbeom Lee, Dongwoo Kim

TL;DR

This survey consolidates recent advances in plane geometry problem solving (PGPS) by organizing approaches around an encoder–decoder framework and detailing how inputs (diagrams and text) are transformed into intermediate representations and final outputs. It distinguishes encoder outputs into formal-language descriptions and embedding vectors, and decoder outputs into theorems, logic programs, or natural language, highlighting rule-based and neural strategies across pipelines. The paper also discusses critical challenges such as diagram-perception hallucinations and data leakage in benchmarks, and outlines future directions for more robust perception, better benchmark design, and standardized evaluation. Collectively, the work clarifies existing methodological patterns and data-collection practices, guiding future PGPS research toward scalable, reliable multi-modal geometric reasoning with practical impact on tutoring and automated proof systems.

Abstract

Plane geometry problem solving (PGPS) has recently gained significant attention as a benchmark to assess the multi-modal reasoning capabilities of large vision-language models. Despite the growing interest in PGPS, the research community still lacks a comprehensive overview that systematically synthesizes recent work in PGPS. To fill this gap, we present a survey of existing PGPS studies. We first categorize PGPS methods into an encoder-decoder framework and summarize the corresponding output formats used by their encoders and decoders. Subsequently, we classify and analyze these encoders and decoders according to their architectural designs. Finally, we outline major challenges and promising directions for future research. In particular, we discuss the hallucination issues arising during the encoding phase within encoder-decoder architectures, as well as the problem of data leakage in current PGPS benchmarks.

Plane Geometry Problem Solving with Multi-modal Reasoning: A Survey

TL;DR

This survey consolidates recent advances in plane geometry problem solving (PGPS) by organizing approaches around an encoder–decoder framework and detailing how inputs (diagrams and text) are transformed into intermediate representations and final outputs. It distinguishes encoder outputs into formal-language descriptions and embedding vectors, and decoder outputs into theorems, logic programs, or natural language, highlighting rule-based and neural strategies across pipelines. The paper also discusses critical challenges such as diagram-perception hallucinations and data leakage in benchmarks, and outlines future directions for more robust perception, better benchmark design, and standardized evaluation. Collectively, the work clarifies existing methodological patterns and data-collection practices, guiding future PGPS research toward scalable, reliable multi-modal geometric reasoning with practical impact on tutoring and automated proof systems.

Abstract

Plane geometry problem solving (PGPS) has recently gained significant attention as a benchmark to assess the multi-modal reasoning capabilities of large vision-language models. Despite the growing interest in PGPS, the research community still lacks a comprehensive overview that systematically synthesizes recent work in PGPS. To fill this gap, we present a survey of existing PGPS studies. We first categorize PGPS methods into an encoder-decoder framework and summarize the corresponding output formats used by their encoders and decoders. Subsequently, we classify and analyze these encoders and decoders according to their architectural designs. Finally, we outline major challenges and promising directions for future research. In particular, we discuss the hallucination issues arising during the encoding phase within encoder-decoder architectures, as well as the problem of data leakage in current PGPS benchmarks.

Paper Structure

This paper contains 53 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of three PGPS tasks. The three tasks are commonly used to evaluate PGPS methods in existing benchmarks: i) direct-answer, ii) multiple-choice, and iii) reasoning-step construction. In the direct-answer task, the model must predict a single numerical value as the answer to the problem. In the multiple-choice task, the model must select the correct label corresponding to the ground-truth option. In the reasoning-step construction task, the model is asked to generate the complete sequence of reasoning steps that lead to the correct final answer.
  • Figure 2: Visualization of the overall structure of PGPS methods. PGPS methods first encode the input diagram and text into an intermediate representation. The encoded representation is then passed to the decoder, which generates the final solution as a theorem sequence, a logic program, or a natural-language description.
  • Figure 3: Overview of the PGPS pipeline. PGPS methods can be categorized based on the combination of the encoder, intermediate representation, decoder, and output representation. For example, the InterGPS can be represented as a combination of E2, I1, D3, and O1. We summarize PGPS methods as a combination of these components in \ref{['tab:pgps_methods']}.
  • Figure A1: Error analysis on the response of GPT-4V on MathVerse. We analyze the responses of GPT-4V on MathVerse, reporting the average percentage for each type of error across five MathVerse variants, Text Dominant, Text Lite, Vision Intensive, Vision Dominant, and Vision Only, which are reported in MathVerse. Our analysis indicates that incorrect answers predominantly result from visual perception and reasoning errors.
  • Figure A2: Visualization of the synthetic and real-world geometric diagrams. We compare the geometric diagrams, which are synthetically generated or manually collected from existing sources. The synthetic diagrams are from GeomVerse, VisOnlyQA, MAVIS, and GeoDANO. The real-world diagrams are from MathVerse.