Table of Contents
Fetching ...

From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

Weikang Shi, Houxing Ren, Junting Pan, Aojun Zhou, Ke Wang, Zimu Lu, Yunqiao Yang, Yuxuan Hu, Linda Wei, Mingjie Zhan, Hongsheng Li

TL;DR

KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives, is introduced, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

Abstract

Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

TL;DR

KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives, is introduced, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

Abstract

Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.
Paper Structure (31 sections, 1 equation, 26 figures, 6 tables)

This paper contains 31 sections, 1 equation, 26 figures, 6 tables.

Figures (26)

  • Figure 1: The curation pipeline for the KMP-Bench dataset. The process generates four distinct pedagogical components from curated K-8 problems using an LLM guided by human-crafted few-shot examples. After a rigorous quality control process (including LLM-based self-verification and model ensemble validation), these components are organized into Dialogue Flows, which are then manually verified for pedagogical soundness before being woven into final tutoring dialogues.
  • Figure 2: Statistical distributions of the dataset.
  • Figure 3: The evaluation framework of KMP-Dialogue. The process evaluates a model's Tutor Response in the context of a full dialogue history and specific instructions (left panel). An LLM or human evaluator compares the Tutor Response against a Reference Response using the 4 general criteria and the criteria corresponding to the relevant pedagogical principle(s) to determine a final "Win", "Tie", or "Lose" outcome.
  • Figure 4: The evaluation framework for the Mathematical Problem Generation task in KMP-Skills.
  • Figure 5: The distribution of the task errors.
  • ...and 21 more figures