Table of Contents
Fetching ...

Large Language Models for In-Context Student Modeling: Synthesizing Student's Behavior in Visual Programming

Manh Hung Nguyen, Sebastian Tschiatschek, Adish Singla

TL;DR

The paper addresses modeling student behavior in open-ended domains by enabling synthesis of a student’s target-task attempt from observed reference-task behavior. It introduces LLM-SS, a perturbation-based, in-context learning framework that leverages domain-specific fine-tuning to inject expert knowledge and infer student misconceptions, producing synthesized attempts $\widehat{C}^{\textsc{stu}}_{T^{\textnormal{tar}}}$. Formal problem setup includes a two-step process over task spaces $\mathbb{T}$ and $\mathbb{C}$ with a quality rubric $Q_{\text{stu}}$, $Q_{\text{task}}$, and $Q_{\text{overall}} = Q_{\text{stu}} \times Q_{\text{task}}$, demonstrated on the HoCMaze/StudentSyn benchmark. Experimental results show that fine-tuned LLMs substantially improve synthesis quality over NeurSS and, in some cases, approach human tutor performance, highlighting the framework’s potential to scale in-context student modeling without heavy training pipelines.

Abstract

Student modeling is central to many educational technologies as it enables predicting future learning outcomes and designing targeted instructional strategies. However, open-ended learning domains pose challenges for accurately modeling students due to the diverse behaviors and a large space of possible misconceptions. To approach these challenges, we explore the application of large language models (LLMs) for in-context student modeling in open-ended learning domains. More concretely, given a particular student's attempt on a reference task as observation, the objective is to synthesize the student's attempt on a target task. We introduce a novel framework, LLM for Student Synthesis (LLM-SS), that leverages LLMs for synthesizing a student's behavior. Our framework can be combined with different LLMs; moreover, we fine-tune LLMs to boost their student modeling capabilities. We instantiate several methods based on LLM-SS framework and evaluate them using an existing benchmark, StudentSyn, for student attempt synthesis in a visual programming domain. Experimental results show that our methods perform significantly better than the baseline method NeurSS provided in the StudentSyn benchmark. Furthermore, our method using a fine-tuned version of the GPT-3.5 model is significantly better than using the base GPT-3.5 model and gets close to human tutors' performance.

Large Language Models for In-Context Student Modeling: Synthesizing Student's Behavior in Visual Programming

TL;DR

The paper addresses modeling student behavior in open-ended domains by enabling synthesis of a student’s target-task attempt from observed reference-task behavior. It introduces LLM-SS, a perturbation-based, in-context learning framework that leverages domain-specific fine-tuning to inject expert knowledge and infer student misconceptions, producing synthesized attempts . Formal problem setup includes a two-step process over task spaces and with a quality rubric , , and , demonstrated on the HoCMaze/StudentSyn benchmark. Experimental results show that fine-tuned LLMs substantially improve synthesis quality over NeurSS and, in some cases, approach human tutor performance, highlighting the framework’s potential to scale in-context student modeling without heavy training pipelines.

Abstract

Student modeling is central to many educational technologies as it enables predicting future learning outcomes and designing targeted instructional strategies. However, open-ended learning domains pose challenges for accurately modeling students due to the diverse behaviors and a large space of possible misconceptions. To approach these challenges, we explore the application of large language models (LLMs) for in-context student modeling in open-ended learning domains. More concretely, given a particular student's attempt on a reference task as observation, the objective is to synthesize the student's attempt on a target task. We introduce a novel framework, LLM for Student Synthesis (LLM-SS), that leverages LLMs for synthesizing a student's behavior. Our framework can be combined with different LLMs; moreover, we fine-tune LLMs to boost their student modeling capabilities. We instantiate several methods based on LLM-SS framework and evaluate them using an existing benchmark, StudentSyn, for student attempt synthesis in a visual programming domain. Experimental results show that our methods perform significantly better than the baseline method NeurSS provided in the StudentSyn benchmark. Furthermore, our method using a fine-tuned version of the GPT-3.5 model is significantly better than using the base GPT-3.5 model and gets close to human tutors' performance.
Paper Structure (13 sections, 6 figures)

This paper contains 13 sections, 6 figures.

Figures (6)

  • Figure 1: Illustration of our problem setup in a visual programming environment. The scenario is taken from the StudentSyn benchmark DBLP:journals/corr/abs-2205-01265. A synthesizer observes a tuple of ($T^{\textnormal{ref}}$, $C^*_{T^{\textnormal{ref}}}$, $C^{\textsc{stu}}_{T^{\textnormal{ref}}}$) indicating a student $\textsc{stu}$'s behavior. Then, given a target task $T^{\textnormal{tar}}$ along with a solution $C^*_{T^{\textnormal{tar}}}$, the synthesizer generates a student's attempt $\widehat{C}^{\textsc{stu}}_{T^{\textnormal{tar}}}$ that imitates the student's behavior.
  • Figure 2: Prompt template used in LLM-SS framework. {placeholders} are used to include details for each scenario.
  • Figure 3: Fine-tuning an LLM using expert knowledge in LLM-SS framework.
  • Figure 4: (a) shows performances of methods w.r.t. individual attributes in our quality rubric. (b) shows overall performance of capturing both student's behavior and target task's characteristics. Human tutors (TutorSS) serve as an oracle. For methods using a fine-tuned LLM, we report numbers averaged over three fine-tuning runs with standard errors (except GPT-3.5ft-SS with only one run, due to the high costs of using fine-tuning APIs from OpenAI).
  • Figure 5: Losses and evaluations during fine-tuning our two best-performing methods GPT-3.5ft-SS and Llama2-70Bft-SS. We plot data per 0.1 epoch. Losses are plotted on log-scale for better visibility of dynamics. Validation BLEU/accuracy metrics are decided by the fine-tuning library/platform and shown as a sanity check, and are not used for optimization. For GPT-3.5ft-SS, the number of epochs depends on budget spent for OpenAI APIs; we spent roughly half of the total budget for each task. For Llama2-70Bft-SS, the number of epochs are determined by generative performance on a small validation set of examples.
  • ...and 1 more figures