AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

Sankalan Pal Chowdhury; Vilém Zouhar; Mrinmaya Sachan

AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

Sankalan Pal Chowdhury, Vilém Zouhar, Mrinmaya Sachan

TL;DR

This work presents MWPTutor, a hybrid intelligent tutoring system that uses Large Language Models to populate a handcrafted, pedagogy-driven finite-state transducer for math word problems. By separating pedagogy design from domain expertise and introducing guardrails, MWPTutor maintains interpretable, controllable tutoring while gaining the flexibility of LLMs. Empirical results on MathDial and MultiArith show that MWPTutor variants, especially the live GPT-4 version, achieve high tutoring scores with minimal revealing of answers, outperforming a single-prompt GPT-4 baseline in automatic metrics and receiving favorable human judgments on utterances. The study demonstrates a modular, scalable path toward LLM-assisted ITS, while outlining concrete directions to enrich the solution/strategy spaces and improve reliability, context awareness, and dialogue quality for real-world deployment.

Abstract

Large Language Models (LLMs) have found several use cases in education, ranging from automatic question generation to essay evaluation. In this paper, we explore the potential of using Large Language Models (LLMs) to author Intelligent Tutoring Systems. A common pitfall of LLMs is their straying from desired pedagogical strategies such as leaking the answer to the student, and in general, providing no guarantees. We posit that while LLMs with certain guardrails can take the place of subject experts, the overall pedagogical design still needs to be handcrafted for the best learning results. Based on this principle, we create a sample end-to-end tutoring system named MWPTutor, which uses LLMs to fill in the state space of a pre-defined finite state transducer. This approach retains the structure and the pedagogy of traditional tutoring systems that has been developed over the years by learning scientists but brings in additional flexibility of LLM-based approaches. Through a human evaluation study on two datasets based on math word problems, we show that our hybrid approach achieves a better overall tutoring score than an instructed, but otherwise free-form, GPT-4. MWPTutor is completely modular and opens up the scope for the community to improve its performance by improving individual modules or using different teaching strategies that it can follow.

AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

TL;DR

Abstract

Paper Structure (33 sections, 5 figures, 6 tables, 2 algorithms)

This paper contains 33 sections, 5 figures, 6 tables, 2 algorithms.

Introduction
Background
Rule-Based Systems
Large Language Models
Our work
Design Principles
Solution Decomposition
Pedagogical Strategy
Scaffolding the LLM
The Trade-off between Reliability and Contextual Awareness
MWPTutor$_$
Solution Decomposition
Solution Alignment
Pedagogical Strategy
Answer and Path Detection
...and 18 more sections

Figures (5)

Figure 1: Full state space for our MWP toy example. Vertical blue rectangles show the strategy space and Horizontal green rectangles show the solution step space. Solid arrows indicate success in the current step, and dashed arrows indicate failure. The states for branch A are collapsed for clarity.
Figure 2: MWPTutor$_$'s state space as a flowchart. The symbol indicates student utterance inputs, while the symbol indicates model output. The solution step space is collapsed for clarity
Figure 3: Summary of Human evaluation for Hints from MWPTutor$_$. The total number of Hint samples is 102. Each sample was evaluated by 3 annotators in 4 categories.
Figure 4: Summary of Human evaluation for Prompts from MWPTutor$_$. The total number of Prompts samples is $29$ for MWPTutor$^\text{cached}_\text{GPT4}$ and 70 for MWPTutor$^\text{live}_\text{GPT4}$. Each sample was evaluated by 3 annotators in 4 categories
Figure 5: Screenshot of the intrinsic annotation user interfaces. Each crowd worker evaluated 10-20 dialogues. Apart from the short instructions, we showed longer instructions at the beginning.

AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

TL;DR

Abstract

AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

Authors

TL;DR

Abstract

Table of Contents

Figures (5)