TeachTune: Reviewing Pedagogical Agents Against Diverse Student Profiles with Simulated Students
Hyoungwook Jin, Minju Yoo, Jeongeon Park, Yokyung Lee, Xu Wang, Juho Kim
TL;DR
TeachTune addresses the challenge of validating LLM-based pedagogical conversational agents across diverse learners by enabling simulated-student, multi-turn evaluation. The system combines a graph-based PCA authoring interface, templated reader-friendly student profiles, and the Personalized Reflect-Respond pipeline to generate trait-aware, believable student-LLM interactions. Empirical evaluation with teachers shows automated chats expand test coverage and reduce task load, while an ablation indicates trait-overview explanations improve believability, though learning outcomes were not measured. This work offers a scalable, reproducible framework for predeploy PCA testing that supports more inclusive classroom practices and safer deployment of educational AI tutors.
Abstract
Large language models (LLMs) can empower teachers to build pedagogical conversational agents (PCAs) customized for their students. As students have different prior knowledge and motivation levels, teachers must review the adaptivity of their PCAs to diverse students. Existing chatbot reviewing methods (e.g., direct chat and benchmarks) are either manually intensive for multiple iterations or limited to testing only single-turn interactions. We present TeachTune, where teachers can create simulated students and review PCAs by observing automated chats between PCAs and simulated students. Our technical pipeline instructs an LLM-based student to simulate prescribed knowledge levels and traits, helping teachers explore diverse conversation patterns. Our pipeline could produce simulated students whose behaviors correlate highly to their input knowledge and motivation levels within 5% and 10% accuracy gaps. Thirty science teachers designed PCAs in a between-subjects study, and using TeachTune resulted in a lower task load and higher student profile coverage over a baseline.
