Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes

Rose E. Wang; Qingyang Zhang; Carly Robinson; Susanna Loeb; Dorottya Demszky

Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes

Rose E. Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, Dorottya Demszky

TL;DR

Bridge presents a framework for closing the novice-expert gap in math remediation by encoding expert reasoning into a decision-making model via cognitive task analysis. It creates a 700-example dataset of real tutoring dialogues with expert annotations and demonstrates that LLMs, especially GPT-4, significantly benefit when guided by expert decisions; context-sensitive decisions are crucial for high-quality remediation. The paper provides open-source data and methodological tools to embed expert thought processes in AI tutors, offering a scalable path toward equitable, high-quality tutoring. The findings suggest that expert-informed decision pathways can improve student support without sacrificing scalability, with implications for tutoring platforms and education research.

Abstract

Scaling high-quality tutoring remains a major challenge in education. Due to growing demand, many platforms employ novice tutors who, unlike experienced educators, struggle to address student mistakes and thus fail to seize prime learning opportunities. Our work explores the potential of large language models (LLMs) to close the novice-expert knowledge gap in remediating math mistakes. We contribute Bridge, a method that uses cognitive task analysis to translate an expert's latent thought process into a decision-making model for remediation. This involves an expert identifying (A) the student's error, (B) a remediation strategy, and (C) their intention before generating a response. We construct a dataset of 700 real tutoring conversations, annotated by experts with their decisions. We evaluate state-of-the-art LLMs on our dataset and find that the expert's decision-making model is critical for LLMs to close the gap: responses from GPT4 with expert decisions (e.g., "simplify the problem") are +76% more preferred than without. Additionally, context-sensitive decisions are critical to closing pedagogical gaps: random decisions decrease GPT4's response quality by -97% than expert decisions. Our work shows the potential of embedding expert thought processes in LLM generations to enhance their capability to bridge novice-expert knowledge gaps. Our dataset and code can be found at: \url{https://github.com/rosewang2008/bridge}.

Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes

TL;DR

Abstract

Paper Structure (60 sections, 1 equation, 12 figures, 5 tables)

This paper contains 60 sections, 1 equation, 12 figures, 5 tables.

Introduction
Related Work
Modeling the Decision-Making Process of Experts
Responding to Student Mistakes in Mathematics
Automated Feedback in Education
Math Tutoring Datasets
Data Sources
Tutoring transcripts.
Preprocessing.
The Bridge Method for Expert-Guided Decision-Making
Cognitive Task Analysis
Collaboration with experts.
Development of decision-making process.
Development of decision options.
Decision Options
...and 45 more sections

Figures (12)

Figure 1: ① Closing the knowledge gap at scale. LLMs and novice tutors lack the pedagogical knowledge to engage with student mistakes, yet they are readily available for 1:1 tutoring. Experts like experienced teachers have the pedagogical knowledge, but are hard to scale. ② How do we model the expert's thought process? Our work builds Bridge which leverage cognitive task analysis to translate the latent thought process of experts into a decision-making model. ③ Applying Bridge with LLMs. To bridge the knowledge gap, we scale the expert's knowledge with LLMs using the expert-guided decision-making model.
Figure 2: Expert decision-making paths are diverse whereas LLM's are less diverse. The entropy of decision paths is shown in the subcaption: The experts' paths have higher entropy and thus are more diverse than those of the LLMs. The foopink red left column is Step A's error decision; foogreen!30 green middle column is Step B's strategy decision; and fooblue!30 blue right column is Step C's intention decision.
Figure 3: Annotation interface for collecting decisions and responses.
Figure 4: Prompt for the no decision-making condition for gpt-4 and gpt-3.5-turbo.{lesson_topic} is the placeholder for the lesson topic discussed in the conversation. {c_h} is the placeholder for the conversation history leading up to (and including) the student's message that contains the mistake. We add an additional constraint "(maximum one sentence)" because from our experiments, gpt-3.5-turbo and gpt-4 typically output extremely long responses that would be unnatural for this tutoring conversation domain.
Figure 5: Prompt for the no decision-making condition for llama-2.{lesson_topic} is the placeholder for the lesson topic discussed in the conversation. {c_h} is the placeholder for the conversation history leading up to (and including) the student's message that contains the mistake.
...and 7 more figures

Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes

TL;DR

Abstract

Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes

Authors

TL;DR

Abstract

Table of Contents

Figures (12)