Table of Contents
Fetching ...

Which Type of Students can LLMs Act? Investigating Authentic Simulation with Graph-based Human-AI Collaborative System

Haoxuan Li, Jifan Yu, Xin Cong, Yang Dang, Daniel Zhang-li, Lu Mi, Yisi Zhan, Huiqin Liu, Zhiyuan Liu

TL;DR

This work tackles the challenge of credibly simulating diverse student profiles for educational research by introducing a three-stage LLM-human pipeline that generates profiles, evaluates them with two rounds of automated scoring, and refines scores via graph-based propagation. It demonstrates that combining automated scoring with expert calibration yields simulations that better align with human judgments and analyzes which traits and interactions most influence realism. The authors also provide a dataset of simulated student profiles and interactions to support research in academic advising and personalized intervention. Overall, the approach offers a scalable framework for generating authentic educational data while highlighting trade-offs between automation and human oversight.

Abstract

While rapid advances in large language models (LLMs) are reshaping data-driven intelligent education, accurately simulating students remains an important but challenging bottleneck for scalable educational data collection, evaluation, and intervention design. However, current works are limited by scarce real interaction data, costly expert evaluation for realism, and a lack of large-scale, systematic analyses of LLMs ability in simulating students. We address this gap by presenting a three-stage LLM-human collaborative pipeline to automatically generate and filter high-quality student agents. We leverage a two-round automated scoring validated by human experts and deploy a score propagation module to obtain more consistent scores across the student similarity graph. Experiments show that combining automated scoring, expert calibration, and graph-based propagation yields simulated student that more closely track authentication by human judgments. We then analyze which profiles and behaviors are simulated more faithfully, supporting subsequent studies on personalized learning and educational assessment.

Which Type of Students can LLMs Act? Investigating Authentic Simulation with Graph-based Human-AI Collaborative System

TL;DR

This work tackles the challenge of credibly simulating diverse student profiles for educational research by introducing a three-stage LLM-human pipeline that generates profiles, evaluates them with two rounds of automated scoring, and refines scores via graph-based propagation. It demonstrates that combining automated scoring with expert calibration yields simulations that better align with human judgments and analyzes which traits and interactions most influence realism. The authors also provide a dataset of simulated student profiles and interactions to support research in academic advising and personalized intervention. Overall, the approach offers a scalable framework for generating authentic educational data while highlighting trade-offs between automation and human oversight.

Abstract

While rapid advances in large language models (LLMs) are reshaping data-driven intelligent education, accurately simulating students remains an important but challenging bottleneck for scalable educational data collection, evaluation, and intervention design. However, current works are limited by scarce real interaction data, costly expert evaluation for realism, and a lack of large-scale, systematic analyses of LLMs ability in simulating students. We address this gap by presenting a three-stage LLM-human collaborative pipeline to automatically generate and filter high-quality student agents. We leverage a two-round automated scoring validated by human experts and deploy a score propagation module to obtain more consistent scores across the student similarity graph. Experiments show that combining automated scoring, expert calibration, and graph-based propagation yields simulated student that more closely track authentication by human judgments. We then analyze which profiles and behaviors are simulated more faithfully, supporting subsequent studies on personalized learning and educational assessment.

Paper Structure

This paper contains 30 sections, 9 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The distributions of the LLM (green) and human scores (orange) on the authentication of simulated students.
  • Figure 2: Evaluation of Agent Student's Authenticity Using a Q&A. We use colored words to present traits from the profile in the dialogue. The LLM Scorer identified some inconsistencies but deemed the agent's explanations reasonable. However, human experts concluded that the Agent's behavior deviated from the profile (colored in red).
  • Figure 3: The pipeline automates the generation and selection of high-quality simulated student agents. It begins with random profile generation, followed by two rounds of automated scoring for profile and behavior consistency, partially validated by human experts. A graph module propagates scores across a student similarity graph constructed via a sentence encoder. Candidates are ranked and filtered based on propagated scores and finally selected by human experts through real academic advising test.
  • Figure 4: Student profiles include demographic and other information. Details are provided in Appendix \ref{['app:profile']}
  • Figure 5: The distributions of the initial (green), propagated (orange), and human scores (blue). Propagated scores are more aligned with that of human scores.
  • ...and 4 more figures