Table of Contents
Fetching ...

MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems

Jakub Macina, Nico Daheim, Sankalan Pal Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan

TL;DR

MathDial tackles the scarcity of high-quality dialogue tutoring data by introducing a semi-synthetic pipeline that pairs expert teachers with an LLM-simulated student to produce ~3k grounded, multi-step math tutoring dialogues. Rich annotations (grounding, student confusions, teacher moves) enable fine-tuning of tutors rather than mere problem solvers, and the dataset supports comprehensive evaluation including interactive tutoring scenarios. Key findings show small, finetuned open-source models can exceed large prompting LLMs in tutoring tasks, especially when grounded in step-by-step solutions, though generalization to unseen problems remains challenging. The work delivers a substantial, publicly available resource and demonstrates the importance of pedagogical grounding for scalable, equitable math tutoring systems with practical implications for education technology.

Abstract

While automatic dialogue tutors hold great potential in making education personalized and more accessible, research on such systems has been hampered by a lack of sufficiently large and high-quality datasets. Collecting such datasets remains challenging, as recording tutoring sessions raises privacy concerns and crowdsourcing leads to insufficient data quality. To address this, we propose a framework to generate such dialogues by pairing human teachers with a Large Language Model (LLM) prompted to represent common student errors. We describe how we use this framework to collect MathDial, a dataset of 3k one-to-one teacher-student tutoring dialogues grounded in multi-step math reasoning problems. While models like GPT-3 are good problem solvers, they fail at tutoring because they generate factually incorrect feedback or are prone to revealing solutions to students too early. To overcome this, we let teachers provide learning opportunities to students by guiding them using various scaffolding questions according to a taxonomy of teacher moves. We demonstrate MathDial and its extensive annotations can be used to finetune models to be more effective tutors (and not just solvers). We confirm this by automatic and human evaluation, notably in an interactive setting that measures the trade-off between student solving success and telling solutions. The dataset is released publicly.

MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems

TL;DR

MathDial tackles the scarcity of high-quality dialogue tutoring data by introducing a semi-synthetic pipeline that pairs expert teachers with an LLM-simulated student to produce ~3k grounded, multi-step math tutoring dialogues. Rich annotations (grounding, student confusions, teacher moves) enable fine-tuning of tutors rather than mere problem solvers, and the dataset supports comprehensive evaluation including interactive tutoring scenarios. Key findings show small, finetuned open-source models can exceed large prompting LLMs in tutoring tasks, especially when grounded in step-by-step solutions, though generalization to unseen problems remains challenging. The work delivers a substantial, publicly available resource and demonstrates the importance of pedagogical grounding for scalable, equitable math tutoring systems with practical implications for education technology.

Abstract

While automatic dialogue tutors hold great potential in making education personalized and more accessible, research on such systems has been hampered by a lack of sufficiently large and high-quality datasets. Collecting such datasets remains challenging, as recording tutoring sessions raises privacy concerns and crowdsourcing leads to insufficient data quality. To address this, we propose a framework to generate such dialogues by pairing human teachers with a Large Language Model (LLM) prompted to represent common student errors. We describe how we use this framework to collect MathDial, a dataset of 3k one-to-one teacher-student tutoring dialogues grounded in multi-step math reasoning problems. While models like GPT-3 are good problem solvers, they fail at tutoring because they generate factually incorrect feedback or are prone to revealing solutions to students too early. To overcome this, we let teachers provide learning opportunities to students by guiding them using various scaffolding questions according to a taxonomy of teacher moves. We demonstrate MathDial and its extensive annotations can be used to finetune models to be more effective tutors (and not just solvers). We confirm this by automatic and human evaluation, notably in an interactive setting that measures the trade-off between student solving success and telling solutions. The dataset is released publicly.
Paper Structure (45 sections, 1 equation, 10 figures, 6 tables)

This paper contains 45 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Current models achieve high accuracy in solving MWPs but struggle with teaching since they often give incorrect feedback or reveal directly the solution too early. MathDial mitigates this using scaffolding questions and grounding annotations.
  • Figure 2: Overview of the data collection pipeline: First, student confusions are oversampled from an LLM and sorted by frequency. Then, a human teacher synchronously interacts with a student simulated by an LLM that is instructed with a student profile and incorrect solution.
  • Figure 3: Teacher judgments on the ability of InstructGPT to simulate students. Teachers rate the simulated behaviour as largely plausible. Lighter regions on top account for questions where the confusion was not resolved.
  • Figure 4: Overall distribution of teacher moves (left) and their distribution at each dialogue step (right). Teachers tend to start with Focus and Probing and then increasingly use Telling as the conversation progresses.
  • Figure 5: Performance of our tutor model and $3$ baselines on interactive tutoring of the student model. We find the model trained on MathDial to have a similar success@5 rate with less telling.
  • ...and 5 more figures