ChatGPT as a Solver and Grader of Programming Exams written in Spanish
Pablo Saborido-Fernández, Marcos Fernández-Pichel, David E. Losada
TL;DR
The paper addresses whether a leading LLM can both solve and grade real programming exam problems written in Spanish. It adopts an empirical setup using a May 2023 first-year CS programming exam (7 questions) and two prompting variants to evaluate solving performance, complemented by a grading experiment on five human solutions. Findings show ChatGPT handles basic coding tasks but falters on ADT specification and divide-and-conquer reasoning, and it tends to overestimate human solution quality when acting as a grader. The authors contribute a new Spanish-language programming task corpus and a set of prompts for solving and grading, enabling replication and furtherResearch. The work informs education technology by highlighting the need for human-in-the-loop systems and improved prompting or model capabilities for reliable automated assessment.
Abstract
Evaluating the capabilities of Large Language Models (LLMs) to assist teachers and students in educational tasks is receiving increasing attention. In this paper, we assess ChatGPT's capacities to solve and grade real programming exams, from an accredited BSc degree in Computer Science, written in Spanish. Our findings suggest that this AI model is only effective for solving simple coding tasks. Its proficiency in tackling complex problems or evaluating solutions authored by others are far from effective. As part of this research, we also release a new corpus of programming tasks and the corresponding prompts for solving the problems or grading the solutions. This resource can be further exploited by other research teams.
