Table of Contents
Fetching ...

GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models

Aditya Sharma, Aman Dalmia, Mehran Kazemi, Amal Zouaq, Christopher J. Pal

TL;DR

GeoCoder is presented, which leverages modular code-finetuning to generate and execute code using a predefined geometry function library, and a multimodal retrieval-augmented variant of GeoCoder, named RAG-GeoCoder, which incorporates a non-parametric memory module for retrieving functions from the geometry library, thereby reducing reliance on parametric memory.

Abstract

Geometry problem-solving demands advanced reasoning abilities to process multimodal inputs and employ mathematical knowledge effectively. Vision-language models (VLMs) have made significant progress in various multimodal tasks. Yet, they still struggle with geometry problems and are significantly limited by their inability to perform mathematical operations not seen during pre-training, such as calculating the cosine of an arbitrary angle, and by difficulties in correctly applying relevant geometry formulas. To overcome these challenges, we present GeoCoder, which leverages modular code-finetuning to generate and execute code using a predefined geometry function library. By executing the code, we achieve accurate and deterministic calculations, contrasting the stochastic nature of autoregressive token prediction, while the function library minimizes errors in formula usage. We also propose a multimodal retrieval-augmented variant of GeoCoder, named RAG-GeoCoder, which incorporates a non-parametric memory module for retrieving functions from the geometry library, thereby reducing reliance on parametric memory. Our modular code-finetuning approach enhances the geometric reasoning capabilities of VLMs, yielding an average improvement of over 16% across various question complexities on the GeomVerse dataset compared to other finetuning methods.

GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models

TL;DR

GeoCoder is presented, which leverages modular code-finetuning to generate and execute code using a predefined geometry function library, and a multimodal retrieval-augmented variant of GeoCoder, named RAG-GeoCoder, which incorporates a non-parametric memory module for retrieving functions from the geometry library, thereby reducing reliance on parametric memory.

Abstract

Geometry problem-solving demands advanced reasoning abilities to process multimodal inputs and employ mathematical knowledge effectively. Vision-language models (VLMs) have made significant progress in various multimodal tasks. Yet, they still struggle with geometry problems and are significantly limited by their inability to perform mathematical operations not seen during pre-training, such as calculating the cosine of an arbitrary angle, and by difficulties in correctly applying relevant geometry formulas. To overcome these challenges, we present GeoCoder, which leverages modular code-finetuning to generate and execute code using a predefined geometry function library. By executing the code, we achieve accurate and deterministic calculations, contrasting the stochastic nature of autoregressive token prediction, while the function library minimizes errors in formula usage. We also propose a multimodal retrieval-augmented variant of GeoCoder, named RAG-GeoCoder, which incorporates a non-parametric memory module for retrieving functions from the geometry library, thereby reducing reliance on parametric memory. Our modular code-finetuning approach enhances the geometric reasoning capabilities of VLMs, yielding an average improvement of over 16% across various question complexities on the GeomVerse dataset compared to other finetuning methods.

Paper Structure

This paper contains 29 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Sample geometry problem from the GeomVerse geomverse dataset. Question: If the ABCD shape is a rectangle where a semi-circle has been removed from one side of it, the area of the BAF right triangle is 40 and the area of the BGHF parallelogram is 102, compute the area of the ABCD shape.
  • Figure 2: The first step in our methodology consists of generating modular code by employing few-shot prompting with a code generation-capable LLM, utilizing questions, TikZ image illustrations, CoT reasoning, and the predefined function library. The generations that execute to produce the correct answer are selected as the basis for our "gold" code-tuning data, as discussed in Section \ref{['subsec:geo_data_gen']}.
  • Figure 3: During modular code-finetuning, we utilize the code-tuning data produced by our teacher LLM (see Section \ref{['subsec:geo_data_gen']}) to finetune a significantly smaller VLM, which we refer to as GeoCoder (as discussed in Section \ref{['subsec:geo_model']}).
  • Figure 4: For each geometry problem, given the image and question text, our multimodal retriever retrieves the most similar functions from the function memory, as discussed in Section \ref{['subsec:retriever']}.
  • Figure 5: Modular functions add interpretability, as discussed in Section \ref{['subsec:interpretability']}. In this example, the underlined values are filled in the template by the '$third\_angle\_of\_triangle$' function.
  • ...and 3 more figures