Table of Contents
Fetching ...

TeachTune: Reviewing Pedagogical Agents Against Diverse Student Profiles with Simulated Students

Hyoungwook Jin, Minju Yoo, Jeongeon Park, Yokyung Lee, Xu Wang, Juho Kim

TL;DR

TeachTune addresses the challenge of validating LLM-based pedagogical conversational agents across diverse learners by enabling simulated-student, multi-turn evaluation. The system combines a graph-based PCA authoring interface, templated reader-friendly student profiles, and the Personalized Reflect-Respond pipeline to generate trait-aware, believable student-LLM interactions. Empirical evaluation with teachers shows automated chats expand test coverage and reduce task load, while an ablation indicates trait-overview explanations improve believability, though learning outcomes were not measured. This work offers a scalable, reproducible framework for predeploy PCA testing that supports more inclusive classroom practices and safer deployment of educational AI tutors.

Abstract

Large language models (LLMs) can empower teachers to build pedagogical conversational agents (PCAs) customized for their students. As students have different prior knowledge and motivation levels, teachers must review the adaptivity of their PCAs to diverse students. Existing chatbot reviewing methods (e.g., direct chat and benchmarks) are either manually intensive for multiple iterations or limited to testing only single-turn interactions. We present TeachTune, where teachers can create simulated students and review PCAs by observing automated chats between PCAs and simulated students. Our technical pipeline instructs an LLM-based student to simulate prescribed knowledge levels and traits, helping teachers explore diverse conversation patterns. Our pipeline could produce simulated students whose behaviors correlate highly to their input knowledge and motivation levels within 5% and 10% accuracy gaps. Thirty science teachers designed PCAs in a between-subjects study, and using TeachTune resulted in a lower task load and higher student profile coverage over a baseline.

TeachTune: Reviewing Pedagogical Agents Against Diverse Student Profiles with Simulated Students

TL;DR

TeachTune addresses the challenge of validating LLM-based pedagogical conversational agents across diverse learners by enabling simulated-student, multi-turn evaluation. The system combines a graph-based PCA authoring interface, templated reader-friendly student profiles, and the Personalized Reflect-Respond pipeline to generate trait-aware, believable student-LLM interactions. Empirical evaluation with teachers shows automated chats expand test coverage and reduce task load, while an ablation indicates trait-overview explanations improve believability, though learning outcomes were not measured. This work offers a scalable, reproducible framework for predeploy PCA testing that supports more inclusive classroom practices and safer deployment of educational AI tutors.

Abstract

Large language models (LLMs) can empower teachers to build pedagogical conversational agents (PCAs) customized for their students. As students have different prior knowledge and motivation levels, teachers must review the adaptivity of their PCAs to diverse students. Existing chatbot reviewing methods (e.g., direct chat and benchmarks) are either manually intensive for multiple iterations or limited to testing only single-turn interactions. We present TeachTune, where teachers can create simulated students and review PCAs by observing automated chats between PCAs and simulated students. Our technical pipeline instructs an LLM-based student to simulate prescribed knowledge levels and traits, helping teachers explore diverse conversation patterns. Our pipeline could produce simulated students whose behaviors correlate highly to their input knowledge and motivation levels within 5% and 10% accuracy gaps. Thirty science teachers designed PCAs in a between-subjects study, and using TeachTune resulted in a lower task load and higher student profile coverage over a baseline.
Paper Structure (64 sections, 15 figures, 5 tables)

This paper contains 64 sections, 15 figures, 5 tables.

Figures (15)

  • Figure 1: The interface used for the formative interview. On the left is the Direct Chat tab, where interviewees could converse with the chatbot as the student's role. Interviewees could roll back to previous messages by clicking the rewind button next to the chatbot's message. On the right is the Test Cases tab, where interviewees can add a set of student utterances and see chat responses.
  • Figure 2: A PCA follows the dialogue flow defined in its state diagram. Nodes represent the PCA's utterance, and edges represent the potential response path of simulated students. The root node (A) contains the PCA's starting message and initial behavior. Based on a student's response, the master agent keeps the current state or changes the active node to one of the connected nodes (B). The next active node determines the PCA's subsequent response (C).
  • Figure 3: The TeachTune interface. On the right, a teacher can add new student profiles (A) and review their auto-generated conversation (B). The teacher can also check the student's current knowledge stage at each utterance (C). On the left is the PCA creation interface with a state diagram. The robot icon shows the current state (i.e., active node) of the PCA at each turn (D). The PCA changes its behavior according to the conditions (E) and follows the instructions written on the currently active node (F).
  • Figure 4: The interface to create a student profile. Teachers set the initial knowledge level of the student by check-marking the knowledge components to turn on at the beginning of a conversation (A). They also rate 5-point Likert scale questions to configure the four unique student traits (B). TeachTune generates a (C) natural language student profile overview based on the information set from (B). Users can edit the system-generated description or add more contextual information about a student.
  • Figure 5: The Personalized Reflect-Respond pipeline. The pipeline interprets the student's trait values and creates a trait overview (1), and the previous conversation history is used to update the knowledge state through the reflect pipeline (2). Afterward, the Respond pipeline takes the conversation, updated knowledge state, and the trait overview to generate the response (3). The blue background is a runtime area where the components inside change throughout a conversation. The trait overview is created once before the runtime.
  • ...and 10 more figures