LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought

Zhuoxuan Jiang; Haoyuan Peng; Shanshan Feng; Fan Li; Dongsheng Li

LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought

Zhuoxuan Jiang, Haoyuan Peng, Shanshan Feng, Fan Li, Dongsheng Li

TL;DR

This work tackles the challenge of reliably identifying mathematical reasoning mistakes in LLM outputs, a prerequisite for effective self-correction. It introduces Pedagogical Chain-of-Thought (PedCoT), a prompting framework that integrates Bloom Cognitive Model-based pedagogical principles with a two-stage interaction process to ground reasoning and detect errors. Empirical results on BIG-Bench Mistake and PRM800K demonstrate that PedCoT significantly outperforms strong zero-shot and two-stage baselines across GPT-4 variants, with ablations underscoring the importance of each principle and the two-stage design. The study highlights the practical potential of domain-knowledge guided prompting for robust math reasoning and automatic answer grading, suggesting broader applicability to structured reasoning tasks beyond mathematics.

Abstract

Self-correction is emerging as a promising approach to mitigate the issue of hallucination in Large Language Models (LLMs). To facilitate effective self-correction, recent research has proposed mistake detection as its initial step. However, current literature suggests that LLMs often struggle with reliably identifying reasoning mistakes when using simplistic prompting strategies. To address this challenge, we introduce a unique prompting strategy, termed the Pedagogical Chain-of-Thought (PedCoT), which is specifically designed to guide the identification of reasoning mistakes, particularly mathematical reasoning mistakes. PedCoT consists of pedagogical principles for prompts (PPP) design, two-stage interaction process (TIP) and grounded PedCoT prompts, all inspired by the educational theory of the Bloom Cognitive Model (BCM). We evaluate our approach on two public datasets featuring math problems of varying difficulty levels. The experiments demonstrate that our zero-shot prompting strategy significantly outperforms strong baselines. The proposed method can achieve the goal of reliable mathematical mistake identification and provide a foundation for automatic math answer grading. The results underscore the significance of educational theory, serving as domain knowledge, in guiding prompting strategy design for addressing challenging tasks with LLMs effectively.

LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought

TL;DR

Abstract

Paper Structure (33 sections, 2 equations, 2 figures, 5 tables)

This paper contains 33 sections, 2 equations, 2 figures, 5 tables.

Introduction
Related Work
Automatic Answer Grading
Chain-of-Thought Prompting
Detecting Reasoning Errors with LLMs
Problem Statement and Analysis
Problem Definition
Analysis
Ability Level.
Pedagogical Principles for Prompt Design.
Methodology
Two-Stage Interaction Process
Stage-1: Regenerate.
Stage-2: Extract-Compare.
Pedagogical Chain-of-Thought Prompts
...and 18 more sections

Figures (2)

Figure 1: We develop the principles for prompt design for LLMs by leveraging the educational Bloom Cognitive Model and we focus on the learning ability. The bold parts are keywords used in prompts.
Figure 2: Diagram of two-stage interaction process (TIP) with LLMs for finding mistakes at the $i$-th step. The left and right parts are the exampling input and output contents. The detailed contents, as well as complete prompts, of the example can be referred to Appendix.

LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought

TL;DR

Abstract

LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought

Authors

TL;DR

Abstract

Table of Contents

Figures (2)