Evaluating Large Language Models for automatic analysis of teacher simulations
David de-Fitero-Dominguez, Mariano Albaladejo-González, Antonio Garcia-Cabot, Eva Garcia-Lopez, Antonio Moreno-Cediel, Erin Barno, Justin Reich
TL;DR
This study evaluates large language models for automatic analysis of teacher simulations, comparing DeBERTaV3 and Llama 3 across zero-shot, few-shot, and fine-tuning configurations on 14 predefined characteristics derived from a teacher-education DS. Using a dataset of 4,822 labeled response-characteristic pairs, the authors demonstrate substantial variability by characteristic and show that Llama 3 with fine-tuned few-shot training offers the strongest generalization to unseen characteristics, while DeBERTaV3 excels on characteristics seen during training. The findings guide DS designers on model selection and prompting strategies, highlighting the potential and limitations of LLMs for scalable, automatic feedback in teacher simulations. The work also outlines future directions for cross-simulation generalization, analysis of difficult characteristics, and educator adoption of LLM-assisted evaluation.
Abstract
Digital Simulations (DS) provide safe environments where users interact with an agent through conversational prompts, providing engaging learning experiences that can be used to train teacher candidates in realistic classroom scenarios. These simulations usually include open-ended questions, allowing teacher candidates to express their thoughts but complicating an automatic response analysis. To address this issue, we have evaluated Large Language Models (LLMs) to identify characteristics (user behaviors) in the responses of DS for teacher education. We evaluated the performance of DeBERTaV3 and Llama 3, combined with zero-shot, few-shot, and fine-tuning. Our experiments discovered a significant variation in the LLMs' performance depending on the characteristic to identify. Additionally, we noted that DeBERTaV3 significantly reduced its performance when it had to identify new characteristics. In contrast, Llama 3 performed better than DeBERTaV3 in detecting new characteristics and showing more stable performance. Therefore, in DS where teacher educators need to introduce new characteristics because they change depending on the simulation or the educational objectives, it is more recommended to use Llama 3. These results can guide other researchers in introducing LLMs to provide the highly demanded automatic evaluations in DS.
