Debugging Without Error Messages: How LLM Prompting Strategy Affects Programming Error Explanation Effectiveness
Audrey Salmon, Katie Hammer, Eddie Antonio Santos, Brett A. Becker
TL;DR
The paper evaluates how GPT-3.5 can explain programming errors when the original error messages are removed from prompts. It collects erroneous TigerJython programs, creates manual explanations, and compares baseline, one-shot, and fine-tuned prompting strategies. Results show that about 2–3 useful explanations arise per misleading one without error messages, and accuracy is not significantly improved by prompting strategy, though fine-tuning yields shorter, more on-topic explanations with no extraneous content. The findings suggest that source-code context is crucial for effective feedback and have practical implications for pedagogy, highlighting how GenAI can support novice debugging without relying on traditional PEMs. The work provides actionable guidance for educators on leveraging LLMs to improve programming error feedback in classrooms.
Abstract
Making errors is part of the programming process -- even for the most seasoned professionals. Novices in particular are bound to make many errors while learning. It is well known that traditional (compiler/interpreter) programming error messages have been less than helpful for many novices and can have effects such as being frustrating, containing confusing jargon, and being downright misleading. Recent work has found that large language models (LLMs) can generate excellent error explanations, but that the effectiveness of these error messages heavily depends on whether the LLM has been provided with context -- typically the original source code where the problem occurred. Knowing that programming error messages can be misleading and/or contain that serves little-to-no use (particularly for novices) we explore the reverse: what happens when GPT-3.5 is prompted for error explanations on just the erroneous source code itself -- original compiler/interpreter produced error message excluded. We utilized various strategies to make more effective error explanations, including one-shot prompting and fine-tuning. We report the baseline results of how effective the error explanations are at providing feedback, as well as how various prompting strategies might improve the explanations' effectiveness. Our results can help educators by understanding how LLMs respond to such prompts that novices are bound to make, and hopefully lead to more effective use of Generative AI in the classroom.
