Table of Contents
Fetching ...

Generating Planning Feedback for Open-Ended Programming Exercises with LLMs

Mehmet Arif Demirtaş, Claire Zheng, Max Fowler, Kathryn Cunningham

TL;DR

This work tackles the challenge of giving students feedback on their planning process for open-ended programming exercises, beyond final test-case results. It proposes detecting high-level programming plans from final code submissions using LLMs, evaluating both a large model (GPT-4o) and a smaller, cost-effective variant (GPT-4o-mini) on a CS1 dataset. Across few-shot prompting and fine-tuning, LLMs outperform AST-based and embedding baselines in plan classification, with micro-F1 scores approaching 0.78, particularly for the fine-tuned GPT-4o-mini. The results indicate that LLMs can provide real-time planning feedback—even for syntactically incorrect code—and suggest broader applicability to other domains where problem solving proceeds via recognizable high-level plans.

Abstract

To complete an open-ended programming exercise, students need to both plan a high-level solution and implement it using the appropriate syntax. However, these problems are often autograded on the correctness of the final submission through test cases, and students cannot get feedback on their planning process. Large language models (LLM) may be able to generate this feedback by detecting the overall code structure even for submissions with syntax errors. To this end, we propose an approach that detects which high-level goals and patterns (i.e. programming plans) exist in a student program with LLMs. We show that both the full GPT-4o model and a small variant (GPT-4o-mini) can detect these plans with remarkable accuracy, outperforming baselines inspired by conventional approaches to code analysis. We further show that the smaller, cost-effective variant (GPT-4o-mini) achieves results on par with state-of-the-art (GPT-4o) after fine-tuning, creating promising implications for smaller models for real-time grading. These smaller models can be incorporated into autograders for open-ended code-writing exercises to provide feedback for students' implicit planning skills, even when their program is syntactically incorrect. Furthermore, LLMs may be useful in providing feedback for problems in other domains where students start with a set of high-level solution steps and iteratively compute the output, such as math and physics problems.

Generating Planning Feedback for Open-Ended Programming Exercises with LLMs

TL;DR

This work tackles the challenge of giving students feedback on their planning process for open-ended programming exercises, beyond final test-case results. It proposes detecting high-level programming plans from final code submissions using LLMs, evaluating both a large model (GPT-4o) and a smaller, cost-effective variant (GPT-4o-mini) on a CS1 dataset. Across few-shot prompting and fine-tuning, LLMs outperform AST-based and embedding baselines in plan classification, with micro-F1 scores approaching 0.78, particularly for the fine-tuned GPT-4o-mini. The results indicate that LLMs can provide real-time planning feedback—even for syntactically incorrect code—and suggest broader applicability to other domains where problem solving proceeds via recognizable high-level plans.

Abstract

To complete an open-ended programming exercise, students need to both plan a high-level solution and implement it using the appropriate syntax. However, these problems are often autograded on the correctness of the final submission through test cases, and students cannot get feedback on their planning process. Large language models (LLM) may be able to generate this feedback by detecting the overall code structure even for submissions with syntax errors. To this end, we propose an approach that detects which high-level goals and patterns (i.e. programming plans) exist in a student program with LLMs. We show that both the full GPT-4o model and a small variant (GPT-4o-mini) can detect these plans with remarkable accuracy, outperforming baselines inspired by conventional approaches to code analysis. We further show that the smaller, cost-effective variant (GPT-4o-mini) achieves results on par with state-of-the-art (GPT-4o) after fine-tuning, creating promising implications for smaller models for real-time grading. These smaller models can be incorporated into autograders for open-ended code-writing exercises to provide feedback for students' implicit planning skills, even when their program is syntactically incorrect. Furthermore, LLMs may be useful in providing feedback for problems in other domains where students start with a set of high-level solution steps and iteratively compute the output, such as math and physics problems.

Paper Structure

This paper contains 16 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: F1 scores for each of the ten programming plans. Models perform noticeably worse at classifying solutions that are labeled as UNKNOWN.