Table of Contents
Fetching ...

Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors

Nico Daheim, Jakub Macina, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan

TL;DR

This work addresses a core challenge in LLM-based dialog tutoring: accurately detecting student reasoning errors and delivering targeted feedback without hallucinations. It proposes a verify-then-generate architecture where a verifier (classification, error description, or step alignment) assesses a student's solution against a reference, and a separate generator crafts the tutor response conditioned on the verifier’s output. A new dataset of approximately 1,002 teacher-annotated stepwise math solutions provides benchmarks for error detection and grounding feedback, and multiple verifiers are evaluated both automatically and via human teachers. Across automatic metrics and expert evaluation, grounding response generation in verification substantially improves the relevance and correctness of tutor feedback, with the Error Description verifier delivering the strongest improvements and reducing hallucinations. The approach offers a practical pathway to safer, more effective AI tutors and is complemented by open data and models for further research.

Abstract

Large language models (LLMs) present an opportunity to scale high-quality personalized education to all. A promising approach towards this means is to build dialog tutoring models that scaffold students' problem-solving. However, even though existing LLMs perform well in solving reasoning questions, they struggle to precisely detect student's errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions and show how grounding to such verification improves the overall quality of tutor response generation. We collect a dataset of 1K stepwise math reasoning chains with the first error step annotated by teachers. We show empirically that finding the mistake in a student solution is challenging for current models. We propose and evaluate several verifiers for detecting these errors. Using both automatic and human evaluation we show that the student solution verifiers steer the generation model towards highly targeted responses to student errors which are more often correct with less hallucinations compared to existing baselines.

Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors

TL;DR

This work addresses a core challenge in LLM-based dialog tutoring: accurately detecting student reasoning errors and delivering targeted feedback without hallucinations. It proposes a verify-then-generate architecture where a verifier (classification, error description, or step alignment) assesses a student's solution against a reference, and a separate generator crafts the tutor response conditioned on the verifier’s output. A new dataset of approximately 1,002 teacher-annotated stepwise math solutions provides benchmarks for error detection and grounding feedback, and multiple verifiers are evaluated both automatically and via human teachers. Across automatic metrics and expert evaluation, grounding response generation in verification substantially improves the relevance and correctness of tutor feedback, with the Error Description verifier delivering the strongest improvements and reducing hallucinations. The approach offers a practical pathway to safer, more effective AI tutors and is complemented by open data and models for further research.

Abstract

Large language models (LLMs) present an opportunity to scale high-quality personalized education to all. A promising approach towards this means is to build dialog tutoring models that scaffold students' problem-solving. However, even though existing LLMs perform well in solving reasoning questions, they struggle to precisely detect student's errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions and show how grounding to such verification improves the overall quality of tutor response generation. We collect a dataset of 1K stepwise math reasoning chains with the first error step annotated by teachers. We show empirically that finding the mistake in a student solution is challenging for current models. We propose and evaluate several verifiers for detecting these errors. Using both automatic and human evaluation we show that the student solution verifiers steer the generation model towards highly targeted responses to student errors which are more often correct with less hallucinations compared to existing baselines.
Paper Structure (51 sections, 2 equations, 13 figures, 11 tables, 1 algorithm)

This paper contains 51 sections, 2 equations, 13 figures, 11 tables, 1 algorithm.

Figures (13)

  • Figure 1: Directly generating a tutor response based on the conversation history can lead to hallucinations (bottom left). To alleviate this, we split this process into two sequential tasks (right): 1) A model identifies the student's mistake. 2) A different response generation model communicates the identified mistake. We use different verifiers: providing the Error Reasonwang-etal-2024-bridging, Classification-based Verification, providing a more detailed Error Description and a Step Alignment of student and reference solution. Especially the latter two reduce hallucinations and make tutor models more targeted at the student error when verification and generation are combined (\ref{['sec:results']}).
  • Figure 2: User interface for annotating the step of the first error, their categorization, and description of the error.
  • Figure 3: Dataset Distribution. The index of the step with the first error annotated by teachers and the total student solution steps.
  • Figure 4: User interface for explaining the error of the student and evaluation of two error descriptions from models. Afterwards, annotators evaluate the quality of the model responses.
  • Figure 5: User interface for evaluation of the quality of the model responses. Some responses contain attention checks (second question in this case).
  • ...and 8 more figures