Table of Contents
Fetching ...

Comparison of Three Programming Error Measures for Explaining Variability in CS1 Grades

Valdemar Švábenský, Maciej Pankiewicz, Jiayi Zhang, Elizabeth B. Cloude, Ryan S. Baker, Eric Fouh

TL;DR

The paper addresses the limited explanatory power of traditional outcome measures for CS1 performance by introducing process-based error metrics captured in an online IDE. It systematically compares three measures—Error Count (EC), Jadud's Error Quotient (EQ), and Repeated Error Density (RED)—across compiler and runtime errors, using data from 280 novice Java students over two exams. The results show that EQ generally provides the strongest explanation of grade variability, with runtime errors offering additional predictive value for the later-exam topics, though none of the measures fully accounts for the observed variance ($R^2$ values remain modest, e.g., up to about $0.264$ for Exam 2 with EQ and runtime data). The study contributes a direct comparison of error measures in CS1, replication in a different teaching context, and publicly available data and tooling to enable educators to diagnose and support student learning through error-aware feedback and interventions. The findings underscore the potential of IDE-based error analytics to complement outcome-based assessments, guiding instructional design and IDE enhancements to better support novice programmers.

Abstract

Programming courses can be challenging for first year university students, especially for those without prior coding experience. Students initially struggle with code syntax, but as more advanced topics are introduced across a semester, the difficulty in learning to program shifts to learning computational thinking (e.g., debugging strategies). This study examined the relationships between students' rate of programming errors and their grades on two exams. Using an online integrated development environment, data were collected from 280 students in a Java programming course. The course had two parts. The first focused on introductory procedural programming and culminated with exam 1, while the second part covered more complex topics and object-oriented programming and ended with exam 2. To measure students' programming abilities, 51095 code snapshots were collected from students while they completed assignments that were autograded based on unit tests. Compiler and runtime errors were extracted from the snapshots, and three measures -- Error Count, Error Quotient and Repeated Error Density -- were explored to identify the best measure explaining variability in exam grades. Models utilizing Error Quotient outperformed the models using the other two measures, in terms of the explained variability in grades and Bayesian Information Criterion. Compiler errors were significant predictors of exam 1 grades but not exam 2 grades; only runtime errors significantly predicted exam 2 grades. The findings indicate that leveraging Error Quotient with multiple error types (compiler and runtime) may be a better measure of students' introductory programming abilities, though still not explaining most of the observed variability.

Comparison of Three Programming Error Measures for Explaining Variability in CS1 Grades

TL;DR

The paper addresses the limited explanatory power of traditional outcome measures for CS1 performance by introducing process-based error metrics captured in an online IDE. It systematically compares three measures—Error Count (EC), Jadud's Error Quotient (EQ), and Repeated Error Density (RED)—across compiler and runtime errors, using data from 280 novice Java students over two exams. The results show that EQ generally provides the strongest explanation of grade variability, with runtime errors offering additional predictive value for the later-exam topics, though none of the measures fully accounts for the observed variance ( values remain modest, e.g., up to about for Exam 2 with EQ and runtime data). The study contributes a direct comparison of error measures in CS1, replication in a different teaching context, and publicly available data and tooling to enable educators to diagnose and support student learning through error-aware feedback and interventions. The findings underscore the potential of IDE-based error analytics to complement outcome-based assessments, guiding instructional design and IDE enhancements to better support novice programmers.

Abstract

Programming courses can be challenging for first year university students, especially for those without prior coding experience. Students initially struggle with code syntax, but as more advanced topics are introduced across a semester, the difficulty in learning to program shifts to learning computational thinking (e.g., debugging strategies). This study examined the relationships between students' rate of programming errors and their grades on two exams. Using an online integrated development environment, data were collected from 280 students in a Java programming course. The course had two parts. The first focused on introductory procedural programming and culminated with exam 1, while the second part covered more complex topics and object-oriented programming and ended with exam 2. To measure students' programming abilities, 51095 code snapshots were collected from students while they completed assignments that were autograded based on unit tests. Compiler and runtime errors were extracted from the snapshots, and three measures -- Error Count, Error Quotient and Repeated Error Density -- were explored to identify the best measure explaining variability in exam grades. Models utilizing Error Quotient outperformed the models using the other two measures, in terms of the explained variability in grades and Bayesian Information Criterion. Compiler errors were significant predictors of exam 1 grades but not exam 2 grades; only runtime errors significantly predicted exam 2 grades. The findings indicate that leveraging Error Quotient with multiple error types (compiler and runtime) may be a better measure of students' introductory programming abilities, though still not explaining most of the observed variability.
Paper Structure (40 sections, 1 figure, 3 tables)