Table of Contents
Fetching ...

ChatGPT as a Solver and Grader of Programming Exams written in Spanish

Pablo Saborido-Fernández, Marcos Fernández-Pichel, David E. Losada

TL;DR

The paper addresses whether a leading LLM can both solve and grade real programming exam problems written in Spanish. It adopts an empirical setup using a May 2023 first-year CS programming exam (7 questions) and two prompting variants to evaluate solving performance, complemented by a grading experiment on five human solutions. Findings show ChatGPT handles basic coding tasks but falters on ADT specification and divide-and-conquer reasoning, and it tends to overestimate human solution quality when acting as a grader. The authors contribute a new Spanish-language programming task corpus and a set of prompts for solving and grading, enabling replication and furtherResearch. The work informs education technology by highlighting the need for human-in-the-loop systems and improved prompting or model capabilities for reliable automated assessment.

Abstract

Evaluating the capabilities of Large Language Models (LLMs) to assist teachers and students in educational tasks is receiving increasing attention. In this paper, we assess ChatGPT's capacities to solve and grade real programming exams, from an accredited BSc degree in Computer Science, written in Spanish. Our findings suggest that this AI model is only effective for solving simple coding tasks. Its proficiency in tackling complex problems or evaluating solutions authored by others are far from effective. As part of this research, we also release a new corpus of programming tasks and the corresponding prompts for solving the problems or grading the solutions. This resource can be further exploited by other research teams.

ChatGPT as a Solver and Grader of Programming Exams written in Spanish

TL;DR

The paper addresses whether a leading LLM can both solve and grade real programming exam problems written in Spanish. It adopts an empirical setup using a May 2023 first-year CS programming exam (7 questions) and two prompting variants to evaluate solving performance, complemented by a grading experiment on five human solutions. Findings show ChatGPT handles basic coding tasks but falters on ADT specification and divide-and-conquer reasoning, and it tends to overestimate human solution quality when acting as a grader. The authors contribute a new Spanish-language programming task corpus and a set of prompts for solving and grading, enabling replication and furtherResearch. The work informs education technology by highlighting the need for human-in-the-loop systems and improved prompting or model capabilities for reliable automated assessment.

Abstract

Evaluating the capabilities of Large Language Models (LLMs) to assist teachers and students in educational tasks is receiving increasing attention. In this paper, we assess ChatGPT's capacities to solve and grade real programming exams, from an accredited BSc degree in Computer Science, written in Spanish. Our findings suggest that this AI model is only effective for solving simple coding tasks. Its proficiency in tackling complex problems or evaluating solutions authored by others are far from effective. As part of this research, we also release a new corpus of programming tasks and the corresponding prompts for solving the problems or grading the solutions. This resource can be further exploited by other research teams.
Paper Structure (17 sections, 1 figure, 3 tables)

This paper contains 17 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: ChatGPT as a grader. For each exam solved by a student, the bars represent the score given by ChatGPT and the score given by the instructor of the course.