Can LLMs Identify Gaps and Misconceptions in Students' Code Explanations?
Priti Oli, Rabin Banjade, Andrew M. Olney, Vasile Rus
TL;DR
This paper investigates how Large Language Models can identify gaps and misconceptions in students' self-explanations of code; it compares zero-shot prompting, supervised fine-tuning, and preference-alignment (ORPO) across GPT-4, LLaMA3, and Mistral. Using the DeepCode Java dataset (with Python-explained variants) and augmented explanations, the study finds that GPT-4 in zero-shot prompting performs best among prompts, while fine-tuned models with ORPO yield the strongest, more diagnostic feedback. The results show substantial improvements in feedback quality and reductions in hallucinations when employing SFT and ORPO, highlighting the potential of LLMs for automated assessment of student-generated explanations in programming. The findings offer guidance for building scalable, personalized feedback tools for code comprehension and set the stage for further work on robustness and generalization in educational NLP systems.
Abstract
This paper investigates various approaches using Large Language Models (LLMs) to identify gaps and misconceptions in students' self-explanations of specific instructional material, in our case explanations of code examples. This research is a part of our larger effort to automate the assessment of students' freely generated responses, focusing specifically on their self-explanations of code examples during activities related to code comprehension. In this work, we experiment with zero-shot prompting, Supervised Fine-Tuning (SFT), and preference alignment of LLMs to identify gaps in students' self-explanation. With simple prompting, GPT-4 consistently outperformed LLaMA3 and Mistral in identifying gaps and misconceptions, as confirmed by human evaluations. Additionally, our results suggest that fine-tuned large language models are more effective at identifying gaps in students' explanations compared to zero-shot and few-shot prompting techniques. Furthermore, our findings show that the preference optimization approach using Odds Ratio Preference Optimization (ORPO) outperforms SFT in identifying gaps and misconceptions in students' code explanations.
