Can LLMs Reliably Simulate Human Learner Actions? A Simulation Authoring Framework for Open-Ended Learning Environments

Amogh Mannekote; Adam Davies; Jina Kang; Kristy Elizabeth Boyer

Can LLMs Reliably Simulate Human Learner Actions? A Simulation Authoring Framework for Open-Ended Learning Environments

Amogh Mannekote, Adam Davies, Jina Kang, Kristy Elizabeth Boyer

TL;DR

Testing this framework in a physics learning environment, it is found that GPT-4 Turbo maintains calibrated behavior even as the underlying learner model changes, providing the first evidence that LLMs can be used to simulate realistic behaviors in open-ended interactive learning environments, a necessary prerequisite for useful LLM behavioral simulation.

Abstract

Simulating learner actions helps stress-test open-ended interactive learning environments and prototype new adaptations before deployment. While recent studies show the promise of using large language models (LLMs) for simulating human behavior, such approaches have not gone beyond rudimentary proof-of-concept stages due to key limitations. First, LLMs are highly sensitive to minor prompt variations, raising doubts about their ability to generalize to new scenarios without extensive prompt engineering. Moreover, apparently successful outcomes can often be unreliable, either because domain experts unintentionally guide LLMs to produce expected results, leading to self-fulfilling prophecies; or because the LLM has encountered highly similar scenarios in its training data, meaning that models may not be simulating behavior so much as regurgitating memorized content. To address these challenges, we propose Hyp-Mix, a simulation authoring framework that allows experts to develop and evaluate simulations by combining testable hypotheses about learner behavior. Testing this framework in a physics learning environment, we found that GPT-4 Turbo maintains calibrated behavior even as the underlying learner model changes, providing the first evidence that LLMs can be used to simulate realistic behaviors in open-ended interactive learning environments, a necessary prerequisite for useful LLM behavioral simulation.

Can LLMs Reliably Simulate Human Learner Actions? A Simulation Authoring Framework for Open-Ended Learning Environments

TL;DR

Abstract

Paper Structure (40 sections, 4 figures, 3 tables)

This paper contains 40 sections, 4 figures, 3 tables.

Introduction
Related Work
Simulated Learner Behavior for Authoring Educational Technologies.
Simulating Human Behavior with LLMs
Prompt Sensitivity and Prompt Calibration.
The Hyp-Mix Framework
mdh for Simulation Evaluation
mdh for Simulation Authoring
Achieving Mix-and-Match Simulation Authoring with mdh
Existing Notions of "Calibration"
Holding Calibration
Hypothesis Classes
Template Calibration
Experiments
Learning Environment
...and 25 more sections

Figures (4)

Figure 1: We characterize the effort involved in authoring LLM-based simulations of learner behavior as a function of two key attributes of the simulation authoring process: 1) prompt sensitivity and 2) the extent of environment-specific handcrafting required during development. High prompt sensitivity necessitates excessive editing for minor phrasing changes, thus consuming valuable expert time. On the other hand, the need for environment-specific handcrafting arises when an LLM struggles to generalize across learning environments, impeding rapid iteration. The proposed approach of mixing-and-matching expert-hypotheses to define simulation behavior offers a promising balance, enabling authors to impose necessary constraints while leveraging the advantages of state-of-the-art knowledge and reasoning capabilities of LLMs for "filling in the gaps."
Figure 2: A screenshot of the original HoloOrbits environment 2024.EDM-short-papers.56 with the keypoints annotated.
Figure 3: The Learner Model Edit Graph used in our experiments to evaluate llm robustness across five distinct edit operations to the learner model. Each node represents a "snapshot" of the learner model after specific edits by the developer. Inside each node, the mdh comprising the learner model snapshot are listed. Green nodes indicate calibrated snapshots, while yellow nodes represent states untested for calibration. Each mdh in the learner model is annotated with a superscript: '?' for untested calibration status and '*' for confirmed calibration. (1) Ex-Situ Transfer: Tests if an mdh that is calibrated alongside other mdh remains calibrated when tested alone. (2) Combine Hypotheses: Assesses if two separately calibrated hypotheses remain stable when combined. (3) Variable Swap: Involves swapping a single variable within a hypothesis. (4) LC Swap: Evaluates if a prompt template calibrated for one learner characteristic works for another in the same class. (5) Calibration Regression: Tests if a calibrated hypothesis remains stable when a new hypothesis is added to the model.
Figure 4: This figure depicts the hierarchical composition of the learner simulation prompt template, $\hat{I}_\text{sim}$, which integrates global fragments ($\hat{I}_\text{global}$), environment descriptions ($\hat{I}_\text{environment}$), and learner persona values ($\hat{I}_\text{learner}$) to provide contextual grounding. The template also includes Learner Characteristic (LC) Models, $\hat{I}_\text{LC}(\mathcal{M})$, which are parameterized to simulate responses under different hypotheses, $H_{i,j}$, evaluated within individual LC models ($M_1$, $M_2$). These components collectively facilitate the generation of contextually appropriate actions in the simulation, reflecting the interplay between the environment and the learner's characteristics.

Can LLMs Reliably Simulate Human Learner Actions? A Simulation Authoring Framework for Open-Ended Learning Environments

TL;DR

Abstract

Can LLMs Reliably Simulate Human Learner Actions? A Simulation Authoring Framework for Open-Ended Learning Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (4)