Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure

Romain Puech; Jakub Macina; Julia Chatain; Mrinmaya Sachan; Manu Kapur

Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure

Romain Puech, Jakub Macina, Julia Chatain, Mrinmaya Sachan, Manu Kapur

TL;DR

This work defines Pedagogical Steering for large language-model tutors and presents StratL, a three-part algorithm that steers an LLM through a PF-inspired multi-turn tutoring plan using a transition-graph of intents. By combining a Teacher State Tracing classifier with an expert-defined Intent Selection graph, StratL aims to maximize productive failure—prompting students to generate multiple representations rather than directly revealing solutions. The approach is validated via a simulated study and a field test with 17 ninth-grade students, showing that StratL increases PF fidelity (more student-generated representations) without degrading core tutor qualities like coherence or empathy. The paper also releases a PF problem dataset and source code, discusses practical limitations, and outlines opportunities for classroom integration and future enhancements.

Abstract

One-to-one tutoring is one of the most efficient methods of teaching. With the growing popularity of Large Language Models (LLMs), there have been efforts to create LLM based conversational tutors which can expand the benefits of one to one tutoring to everyone. However, current LLMs are trained primarily to be helpful assistants and lack crucial pedagogical skills. For example, they often quickly reveal the solution to the student and fail to plan for a richer multi turn pedagogical interaction. To use LLMs in pedagogical settings, they need to be steered to use effective teaching strategies: a problem we introduce as Pedagogical Steering. We develop StratL, an algorithm to optimize LLM prompts and steer it to follow a predefined multi-turn tutoring plan represented as a transition graph. As a case study, we create a prototype tutor for high school math following Productive Failure (PF), an advanced and effective learning design. To validate our approach in a real-world setting, we run a field study with 17 high school students in Singapore and show that StratL succeeds in steering the LLM to follow the PF tutoring strategy. Finally, we highlight challenges in Pedagogical Steering of LLMs and offer opportunities for further improvements by publishing a dataset of PF problems and our code.

Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure

TL;DR

Abstract

Paper Structure (42 sections, 3 equations, 10 figures, 6 tables)

This paper contains 42 sections, 3 equations, 10 figures, 6 tables.

Introduction
Background and Related Work
Pedagogical shortcomings of LLMs
LLMs for Tutoring
Productive Failure (PF)
Methodology
Formalism
Dialog Tutoring Task
Tutoring Strategy Modeling
Student State Tracing
Algorithm
Experiments
Research Questions
Dataset
Baselines
...and 27 more sections

Figures (10)

Figure 1: Schematic representation of StratL (in blue), an algorithm to control an LLM's tutoring strategy.
Figure 2: Transition Graph of a PF productive_failure Intent Selection. Nodes correspond to intents (listed in Table \ref{['tab:taxonomy']} of Appendix \ref{['sec:appendixtaxonomy']}). The variables used in the arrows' transition conditions ('a', …, 'm') are codes for the state features (displayed in Figure \ref{['fig:Assessor_prompt']} of Appendix \ref{['sec:appendixassessor']}). At each turn, we transition from one set of intents to the next by following all satisfied arrows. Arrows with no starting node can be taken from any node. At turn $0$, we start at node 'Ask for the Next Step'. We describe the Learning Sciences justifications for this PF modeling in Appendix \ref{['sec:appendixlearningsciencesbases']}.
Figure 3: Field test results and their confidence intervals. 'PF.S' refers to the 'PF Score' as defined in Section \ref{['sec:evaluationmetrics']} and is out of 4. 'RSMs' denotes the number of student-generated RSMs. StratL (V1) succeeds in steering the LLM to follow PF as compared to V2.
Figure 4: Problem Country and its solution.
Figure 5: Problem Consistency and its solution.
...and 5 more figures

Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure

TL;DR

Abstract

Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure

Authors

TL;DR

Abstract

Table of Contents

Figures (10)