Table of Contents
Fetching ...

EduPersona: Benchmarking Subjective Ability Boundaries of Virtual Student Agents

Buyuan Zhu, Shiyu Hu, Yiping Ma, Yuanming Zhang, Kang Hao Cheong

TL;DR

EduPersona addresses the gap in evaluating subjective classroom abilities of virtual student agents by introducing a cross-lingual, cross-subject benchmark grounded in Big Five personality theory. It decomposes subjective performance into three progressive tasks—basic coherence, student realism, and persona consistency—and validates the framework with three open-source LLMs and ten persona-finetuned variants, showing substantial improvements across tasks. The dataset combines 1,308 authentic classroom rounds with a tenfold persona stylization to create ~128k turns, enabling robust evaluation and analysis of behavioral alignment, authenticity, and long-term stability. The work emphasizes that subjective abilities do not scale monotonically with model size, highlights persistent bottlenecks in High Conscientiousness and High Openness personas, and commits to open-sourcing the data and framework to advance trustworthy, human-like AI in education.

Abstract

As large language models are increasingly integrated into education, virtual student agents are becoming vital for classroom simulation and teacher training. Yet their classroom-oriented subjective abilities remain largely unassessed, limiting understanding of model boundaries and hindering trustworthy deployment. We present EduPersona, a large-scale benchmark spanning two languages, three subjects, and ten persona types based on the Big Five theory. The dataset contains 1,308 authentic classroom dialogue rounds, corresponding to 12,814 teacher-student Q&A turns, and is further expanded through persona stylization into roughly 10 times larger scale (128k turns), providing a solid foundation for evaluation. Building on this resource, we decompose hard-to-quantify subjective performance into three progressive tasks: TASK1 basic coherence (whether behavior, emotion, expression, and voice align with classroom context), TASK2 student realism, and TASK3 long-term persona consistency, thereby establishing an evaluation framework grounded in educational theory and research value. We conduct systematic experiments on three representative LLMs, comparing their original versions with ten persona-fine-tuned variants trained on EduPersona. Results show consistent and significant average improvements across all tasks: TASK1 +33.6%, TASK2 +30.6%, and TASK3 +14.9%. These improvements highlight the dataset's effectiveness and research value, while also revealing the heterogeneous difficulty of persona modeling. In summary, EduPersona delivers the first classroom benchmark centered on subjective abilities, establishes a decoupled and verifiable research paradigm, and we will open-source both the dataset and the framework to support the broader research community in advancing trustworthy and human-like AI for education.

EduPersona: Benchmarking Subjective Ability Boundaries of Virtual Student Agents

TL;DR

EduPersona addresses the gap in evaluating subjective classroom abilities of virtual student agents by introducing a cross-lingual, cross-subject benchmark grounded in Big Five personality theory. It decomposes subjective performance into three progressive tasks—basic coherence, student realism, and persona consistency—and validates the framework with three open-source LLMs and ten persona-finetuned variants, showing substantial improvements across tasks. The dataset combines 1,308 authentic classroom rounds with a tenfold persona stylization to create ~128k turns, enabling robust evaluation and analysis of behavioral alignment, authenticity, and long-term stability. The work emphasizes that subjective abilities do not scale monotonically with model size, highlights persistent bottlenecks in High Conscientiousness and High Openness personas, and commits to open-sourcing the data and framework to advance trustworthy, human-like AI in education.

Abstract

As large language models are increasingly integrated into education, virtual student agents are becoming vital for classroom simulation and teacher training. Yet their classroom-oriented subjective abilities remain largely unassessed, limiting understanding of model boundaries and hindering trustworthy deployment. We present EduPersona, a large-scale benchmark spanning two languages, three subjects, and ten persona types based on the Big Five theory. The dataset contains 1,308 authentic classroom dialogue rounds, corresponding to 12,814 teacher-student Q&A turns, and is further expanded through persona stylization into roughly 10 times larger scale (128k turns), providing a solid foundation for evaluation. Building on this resource, we decompose hard-to-quantify subjective performance into three progressive tasks: TASK1 basic coherence (whether behavior, emotion, expression, and voice align with classroom context), TASK2 student realism, and TASK3 long-term persona consistency, thereby establishing an evaluation framework grounded in educational theory and research value. We conduct systematic experiments on three representative LLMs, comparing their original versions with ten persona-fine-tuned variants trained on EduPersona. Results show consistent and significant average improvements across all tasks: TASK1 +33.6%, TASK2 +30.6%, and TASK3 +14.9%. These improvements highlight the dataset's effectiveness and research value, while also revealing the heterogeneous difficulty of persona modeling. In summary, EduPersona delivers the first classroom benchmark centered on subjective abilities, establishes a decoupled and verifiable research paradigm, and we will open-source both the dataset and the framework to support the broader research community in advancing trustworthy and human-like AI for education.

Paper Structure

This paper contains 29 sections, 5 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Workflow Overview of EduPersona. It consists of three steps: (i) dataset construction (cross-subject and cross-lingual classroom dialogues with persona expansion and multimodal labeling, Sec. \ref{['sec:dataset']}); (ii) a three-task evaluation framework (covering coherence, realism, and consistency, Sec. \ref{['sec:evaluation']}); and (iii) systematic experiments and analysis (comparing original and fine-tuned models with cross-model comparisons and case studies, Sec. \ref{['sec:experiments']}). Together, these steps establish the first classroom benchmark focused on subjective abilities, systematically outlining the capability boundaries of virtual student agents.
  • Figure 2: Chinese classroom example with persona-conditioned responses. The top panel shows a real IRF snippet (with English translation), and the bottom presents virtual-student outputs under high/low extraversion with behavior–expression labels. This illustrates the EduPersona pipeline (raw dialogue $\rightarrow$ persona stylization $\rightarrow$ behavior–expression labeling) and highlights how different personas yield linguistic and non-verbal differences within the same teaching context.
  • Figure 3: Cross-subject and persona linguistic variation. The word clouds show high-frequency token distributions across Chinese, Math, and English under high/low persona settings, revealing distinct lexical preferences and expression patterns that provide linguistic features for evaluating student realism and persona consistency.
  • Figure 4: Vocabulary richness across subjects and personas. The heatmap shows virtual students’ vocabulary coverage across three subjects and persona settings, indicating that both factors significantly shape lexical diversity.
  • Figure 5: Impact of persona fine-tuning on basic coherence across five metrics. Persona fine-tuning consistently improves basic coherence across all models, demonstrating the value of EduPersona. Fine-tuned Qwen and DeepSeek achieve OverallAcc above 0.62 with strong label alignment, while InternLM3 also benefits but remains constrained by a low response rate.
  • ...and 5 more figures