Table of Contents
Fetching ...

MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems

Bin Lei, Yi Zhang, Shan Zuo, Ali Payani, Caiwen Ding

TL;DR

This paper addresses the challenge of guiding large language models to perform complex, multi-step mathematical reasoning. It introduces MACM, a generalizable prompting framework that abstracts problems into Conditions and an Objective, and employs a three-agent loop—Thinker, Judge, Executor—to iteratively mine new conditions and compute solutions without problem-specific prompts. Across MATH Level-5 problems, the 24-point game, and sequence sorting, MACM yields substantial accuracy gains and improved error correction compared to CoT, SC-CoT, ToT, and GoT, demonstrating strong generalizability. While the approach increases inference time due to multiple LLM invocations and shows geometry-specific limitations, it provides a scalable blueprint for enhancing mathematical reasoning in LLMs and offers avenues for dataset-driven refinement of model cognition.

Abstract

Recent advancements in large language models, such as GPT-4, have demonstrated remarkable capabilities in processing standard queries. Despite these advancements, their performance substantially declines in \textbf{advanced mathematical problems requiring complex, multi-step logical reasoning}. To enhance their inferential capabilities, current research has delved into \textit{prompting engineering}, exemplified by methodologies such as the Tree of Thought and Graph of Thought. Nonetheless, these existing approaches encounter two significant limitations. Firstly, their effectiveness in tackling complex mathematical problems is somewhat constrained. Secondly, the necessity to design distinct prompts for individual problems hampers their generalizability. In response to these limitations, this paper introduces the \textit{Multi-Agent System for conditional Mining} (\textbf{MACM}) prompting method. It not only resolves intricate mathematical problems but also demonstrates strong generalization capabilities across various mathematical contexts. With the assistance of MACM, the accuracy of GPT-4 Turbo on the most challenging level five mathematical problems in the MATH dataset increase from $\mathbf{54.68\%} \text{ to } \mathbf{76.73\%}$. The code is available in \url{https://github.com/bin123apple/MACM}.

MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems

TL;DR

This paper addresses the challenge of guiding large language models to perform complex, multi-step mathematical reasoning. It introduces MACM, a generalizable prompting framework that abstracts problems into Conditions and an Objective, and employs a three-agent loop—Thinker, Judge, Executor—to iteratively mine new conditions and compute solutions without problem-specific prompts. Across MATH Level-5 problems, the 24-point game, and sequence sorting, MACM yields substantial accuracy gains and improved error correction compared to CoT, SC-CoT, ToT, and GoT, demonstrating strong generalizability. While the approach increases inference time due to multiple LLM invocations and shows geometry-specific limitations, it provides a scalable blueprint for enhancing mathematical reasoning in LLMs and offers avenues for dataset-driven refinement of model cognition.

Abstract

Recent advancements in large language models, such as GPT-4, have demonstrated remarkable capabilities in processing standard queries. Despite these advancements, their performance substantially declines in \textbf{advanced mathematical problems requiring complex, multi-step logical reasoning}. To enhance their inferential capabilities, current research has delved into \textit{prompting engineering}, exemplified by methodologies such as the Tree of Thought and Graph of Thought. Nonetheless, these existing approaches encounter two significant limitations. Firstly, their effectiveness in tackling complex mathematical problems is somewhat constrained. Secondly, the necessity to design distinct prompts for individual problems hampers their generalizability. In response to these limitations, this paper introduces the \textit{Multi-Agent System for conditional Mining} (\textbf{MACM}) prompting method. It not only resolves intricate mathematical problems but also demonstrates strong generalization capabilities across various mathematical contexts. With the assistance of MACM, the accuracy of GPT-4 Turbo on the most challenging level five mathematical problems in the MATH dataset increase from . The code is available in \url{https://github.com/bin123apple/MACM}.
Paper Structure (13 sections, 6 figures, 3 tables)

This paper contains 13 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The Comparison Between the Current Mainstream Prompting Methods and MACM. MACM extracts conditions and the objective from each math problem, iteratively adds new insights to the known conditions, and repeats this until enough information is gathered to reach a solution.
  • Figure 2: The overall structure of MACM. : Original Math problem; : Condition list; : True; : False; : Discard; : Known Conditions; : New Conditions; : Objective; : Thinker; : Judge; : Executor; ➀: Initialize the initial condition list and the objective; ➁: Explore new Conditions based on current condition list; ➂: Check if the new condition is correct; ➄: Check if the objective can be achieved based on the current Conditions in the Condition list; ➅: Designing steps for achieving the objective based on current Conditions; ➆: Achieve the objective based on the designed steps.
  • Figure 3: MACM's detailed analysis process for complex mathematical problems with specific prompts, illustrated with an algebra problem (on the left) and a geometry problem (on the right). We use one set of prompts that can target different types of problems, with prompts 0-6 displayed in the below the dialogue box. In these examples, MACM involves three steps: 1. Extracting conditions and the objective. 2. Iteratively identifying new conditions. 3. Solve the problem based on known conditions.
  • Figure 4: The differences in responses to various questions between an LLM with a defined identity (such as Thinker) and an LLM without a defined identity (like the original LLM). The Thinker consistently provides responses in the same format, while the original LLM produces responses in varying formats.: User; : Original LLM; : Thinker.
  • Figure 5: The performance comparison of GPT-Turbo with and without MACM on Level 5 problems of the MATH dataset.
  • ...and 1 more figures