Table of Contents
Fetching ...

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

Yilin Jiang, Fei Tan, Xuanyu Yin, Jing Leng, Aimin Zhou

TL;DR

HACHIMI is introduced, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas and provides a standardized synthetic student population for group-level benchmarking and social-science simulations.

Abstract

Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas. HACHIMI factorizes each persona into a theory-anchored educational schema, enforces developmental and psychological constraints via a neuro-symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI-1M corpus comprises 1 million personas for Grades 1-12. Intrinsic evaluation shows near-perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom-climate and well-being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5-72B, and HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations. Resources available at https://github.com/ZeroLoss-Lab/HACHIMI

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

TL;DR

HACHIMI is introduced, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas and provides a standardized synthetic student population for group-level benchmarking and social-science simulations.

Abstract

Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas. HACHIMI factorizes each persona into a theory-anchored educational schema, enforces developmental and psychological constraints via a neuro-symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI-1M corpus comprises 1 million personas for Grades 1-12. Intrinsic evaluation shows near-perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom-climate and well-being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5-72B, and HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations. Resources available at https://github.com/ZeroLoss-Lab/HACHIMI
Paper Structure (63 sections, 4 figures, 1 table, 1 algorithm)

This paper contains 63 sections, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: HACHIMI pipeline overview. From target distributions (grade/gender/academic level), steps (1)--(5) produce the HACHIMI-1M corpus.
  • Figure 2: Immersive role-playing prompt template used for HACHIMI student agents when answering CEPS- and PISA-based shadow surveys.
  • Figure 3: Pearson $r$ and Spearman $\rho$ between human and HACHIMI cohort means for each CEPS target.
  • Figure 4: Distribution of Pearson correlations between human and agent group means on PISA 2022, summarized by region.