Table of Contents
Fetching ...

CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning

Joshua Ong Jun Leang, Aryo Pradipta Gema, Shay B. Cohen

TL;DR

CoMAT tackles the challenge of mathematical reasoning in large language models by introducing a two-stage, within-LLM symbolic reasoning framework: Symbolic Conversion to create a formal representation and Reasoning Execution to derive solutions. It eliminates external solvers, improving faithfulness and verifiability while delivering strong gains across diverse benchmarks and languages, including Olympiad-level problems and low-resource contexts. Ablation and Shapley-value analyses show that all four symbolic steps contribute, with the initial conversion step being particularly crucial for accuracy. Overall, CoMAT demonstrates robust, transparent reasoning with improved performance and error traceability across complex mathematical tasks.

Abstract

Mathematical reasoning remains a significant challenge for large language models (LLMs), despite progress in prompting techniques such as Chain-of-Thought (CoT). We present Chain of Mathematically Annotated Thought (CoMAT), which enhances reasoning through two stages: Symbolic Conversion (converting natural language queries into symbolic form) and Reasoning Execution (deriving answers from symbolic representations). CoMAT operates entirely with a single LLM and without external solvers. Across four LLMs, CoMAT outperforms traditional CoT on six out of seven benchmarks, achieving gains of 4.48% on MMLU-Redux (MATH) and 4.58% on GaoKao MCQ. In addition to improved performance, CoMAT ensures faithfulness and verifiability, offering a transparent reasoning process for complex mathematical tasks

CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning

TL;DR

CoMAT tackles the challenge of mathematical reasoning in large language models by introducing a two-stage, within-LLM symbolic reasoning framework: Symbolic Conversion to create a formal representation and Reasoning Execution to derive solutions. It eliminates external solvers, improving faithfulness and verifiability while delivering strong gains across diverse benchmarks and languages, including Olympiad-level problems and low-resource contexts. Ablation and Shapley-value analyses show that all four symbolic steps contribute, with the initial conversion step being particularly crucial for accuracy. Overall, CoMAT demonstrates robust, transparent reasoning with improved performance and error traceability across complex mathematical tasks.

Abstract

Mathematical reasoning remains a significant challenge for large language models (LLMs), despite progress in prompting techniques such as Chain-of-Thought (CoT). We present Chain of Mathematically Annotated Thought (CoMAT), which enhances reasoning through two stages: Symbolic Conversion (converting natural language queries into symbolic form) and Reasoning Execution (deriving answers from symbolic representations). CoMAT operates entirely with a single LLM and without external solvers. Across four LLMs, CoMAT outperforms traditional CoT on six out of seven benchmarks, achieving gains of 4.48% on MMLU-Redux (MATH) and 4.58% on GaoKao MCQ. In addition to improved performance, CoMAT ensures faithfulness and verifiability, offering a transparent reasoning process for complex mathematical tasks

Paper Structure

This paper contains 27 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: An overview of our CoMAT framework. CoMAT divides complex reasoning tasks into two stages: Symbolic Conversion, where queries are translated into structured symbolic reasoning chains (Figure \ref{['fig:CoMAT_main_a']}), and Reasoning Execution, where step-by-step calculations are performed to derive the final answer (Figure \ref{['fig:CoMAT_main_b']}).
  • Figure 2: An overview of CoMAT divided into two main stages: Symbolic Conversion and Reasoning Execution
  • Figure 3: An example question from the MMLU Redux Elementary Mathematics dataset, comparing CoT and CoMAT. CoT follows a generic "step-by-step" approach without further guidance. In contrast, CoMAT enhances interpretability and verifiability by clearly pinpointing the error, which in this case arises from Step 5. Traditional CoT, by comparison, lacks the ability to identify specific errors directly.
  • Figure 4: Average performance across all datasets for each model.
  • Figure 5: Performance change $(\Delta)$ for each configuration with missing steps. Detailed results for all complete variants are provided in Appendix \ref{['Detailed Results for Missing Steps']}.
  • ...and 4 more figures