Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study
Conrad Borchers, Tianze Shou
TL;DR
This study benchmarks whether large language models (LLMs) can match the adaptivity and instructional fidelity of intelligent tutoring systems (ITS) by applying a prompt variation framework to 75 real-world algebra tutoring scenarios from Lynnette. Using three LLMs (Llama3-8B, Llama3-70B, GPT-4o), the authors generate 1,350 instructional moves and test adaptivity with embedding-based randomization tests, alongside a tutor-training classifier to assess pedagogical soundness. Results show limited ITS-like adaptivity overall, with Llama3-70B uniquely showing significant responsiveness to student errors, while GPT-4o adheres to prompts but tends to deliver overly direct feedback, and Llama3-8B excels in perceived soundness but suffers from formatting and coherence issues. The work provides open-source benchmarking code and argues that, as of now, LLMs are unlikely to replicate the structured, context-driven guidance of ITS; it also outlines implications for hybrid tutoring and future benchmarking improvements.
Abstract
Large Language Models (LLMs) hold promise as dynamic instructional aids. Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)--where student knowledge and pedagogical strategies are explicitly modeled. We propose a prompt variation framework to assess LLM-generated instructional moves' adaptivity and pedagogical soundness across 75 real-world tutoring scenarios from an ITS. We systematically remove key context components (e.g., student errors and knowledge components) from prompts to create variations of each scenario. Three representative LLMs (Llama3-8B, Llama3-70B, and GPT-4o) generate 1,350 instructional moves. We use text embeddings and randomization tests to measure how the omission of each context feature impacts the LLMs' outputs (adaptivity) and a validated tutor-training classifier to evaluate response quality (pedagogical soundness). Surprisingly, even the best-performing model only marginally mimics the adaptivity of ITS. Specifically, Llama3-70B demonstrates statistically significant adaptivity to student errors. Although Llama3-8B's recommendations receive higher pedagogical soundness scores than the other models, it struggles with instruction-following behaviors, including output formatting. By contrast, GPT-4o reliably adheres to instructions but tends to provide overly direct feedback that diverges from effective tutoring, prompting learners with open-ended questions to gauge knowledge. Given these results, we discuss how current LLM-based tutoring is unlikely to produce learning benefits rivaling known-to-be-effective ITS tutoring. Through our open-source benchmarking code, we contribute a reproducible method for evaluating LLMs' instructional adaptivity and fidelity.
