Table of Contents
Fetching ...

A Survey Study on the State of the Art of Programming Exercise Generation using Large Language Models

Eduard Frankford, Ingo Höhn, Clemens Sauerwein, Ruth Breu

TL;DR

The paper tackles how large language models can be used to generate programming exercises for education and notes the lack of a focused survey in this area. It conducts a literature survey across 2018–2023, extracting state-of-the-art methods, strengths, and weaknesses, and proposes an evaluation matrix plus the Programming Exercise Generation Benchmark (PEGB) to guide assessment. Key findings show that several LLMs can produce sensible, novel exercises and that decomposition prompts improve results, but test-suite quality and the risk of LLMs easily solving exercises remain major concerns. The work offers a practical framework for educators and researchers to select suitable LLMs and to benchmark future systems, aiming to scale and personalize programming education while mitigating risks.

Abstract

This paper analyzes Large Language Models (LLMs) with regard to their programming exercise generation capabilities. Through a survey study, we defined the state of the art, extracted their strengths and weaknesses and finally proposed an evaluation matrix, helping researchers and educators to decide which LLM is the best fitting for the programming exercise generation use case. We also found that multiple LLMs are capable of producing useful programming exercises. Nevertheless, there exist challenges like the ease with which LLMs might solve exercises generated by LLMs. This paper contributes to the ongoing discourse on the integration of LLMs in education.

A Survey Study on the State of the Art of Programming Exercise Generation using Large Language Models

TL;DR

The paper tackles how large language models can be used to generate programming exercises for education and notes the lack of a focused survey in this area. It conducts a literature survey across 2018–2023, extracting state-of-the-art methods, strengths, and weaknesses, and proposes an evaluation matrix plus the Programming Exercise Generation Benchmark (PEGB) to guide assessment. Key findings show that several LLMs can produce sensible, novel exercises and that decomposition prompts improve results, but test-suite quality and the risk of LLMs easily solving exercises remain major concerns. The work offers a practical framework for educators and researchers to select suitable LLMs and to benchmark future systems, aiming to scale and personalize programming education while mitigating risks.

Abstract

This paper analyzes Large Language Models (LLMs) with regard to their programming exercise generation capabilities. Through a survey study, we defined the state of the art, extracted their strengths and weaknesses and finally proposed an evaluation matrix, helping researchers and educators to decide which LLM is the best fitting for the programming exercise generation use case. We also found that multiple LLMs are capable of producing useful programming exercises. Nevertheless, there exist challenges like the ease with which LLMs might solve exercises generated by LLMs. This paper contributes to the ongoing discourse on the integration of LLMs in education.
Paper Structure (14 sections, 1 table)