Table of Contents
Fetching ...

ReDefining Code Comprehension: Function Naming as a Mechanism for Evaluating Code Comprehension

David H. Smith, Max Fowler, Paul Denny, Craig Zilles

TL;DR

The paper tackles the challenge of automatically assessing code comprehension by reframing EiPE questions as function-name tasks to emphasize high-level understanding over implementation details. It evaluates this approach in an introductory course, using two autograding schemes (One Attempt and Robustness) and analyzing results with a 2-parameter logistic (2PL) IRT model, $P(\theta) = \frac{1}{1 + e^{a_i(\theta - b_i)}}$, with $\theta \in [-3,3]$, $a_i \in [0,2]$, $b_i \in [-3,3]$, and by mapping responses with the SOLO taxonomy. Findings show high item discrimination across questions, with Robustness grading increasing difficulty slightly but reducing false positives compared to the One Attempt approach, while maintaining true positives; One Attempt tends to produce easier items with somewhat lower discrimination. The work also demonstrates alignment between function-name responses and EiPE objectives, and releases an open-source eiplgrader package to enable scalable adoption of this approach in computing education, offering a practical pathway to scalable, high-quality assessment of code comprehension.

Abstract

"Explain in Plain English" (EiPE) questions are widely used to assess code comprehension skills but are challenging to grade automatically. Recent approaches like Code Generation Based Grading (CGBG) leverage large language models (LLMs) to generate code from student explanations and validate its equivalence to the original code using unit tests. However, this approach does not differentiate between high-level, purpose-focused responses and low-level, implementation-focused ones, limiting its effectiveness in assessing comprehension level. We propose a modified approach where students generate function names, emphasizing the function's purpose over implementation details. We evaluate this method in an introductory programming course and analyze it using Item Response Theory (IRT) to understand its effectiveness as exam items and its alignment with traditional EiPE grading standards. We also publish this work as an open source Python package for autograding EiPE questions, providing a scalable solution for adoption.

ReDefining Code Comprehension: Function Naming as a Mechanism for Evaluating Code Comprehension

TL;DR

The paper tackles the challenge of automatically assessing code comprehension by reframing EiPE questions as function-name tasks to emphasize high-level understanding over implementation details. It evaluates this approach in an introductory course, using two autograding schemes (One Attempt and Robustness) and analyzing results with a 2-parameter logistic (2PL) IRT model, , with , , , and by mapping responses with the SOLO taxonomy. Findings show high item discrimination across questions, with Robustness grading increasing difficulty slightly but reducing false positives compared to the One Attempt approach, while maintaining true positives; One Attempt tends to produce easier items with somewhat lower discrimination. The work also demonstrates alignment between function-name responses and EiPE objectives, and releases an open-source eiplgrader package to enable scalable adoption of this approach in computing education, offering a practical pathway to scalable, high-quality assessment of code comprehension.

Abstract

"Explain in Plain English" (EiPE) questions are widely used to assess code comprehension skills but are challenging to grade automatically. Recent approaches like Code Generation Based Grading (CGBG) leverage large language models (LLMs) to generate code from student explanations and validate its equivalence to the original code using unit tests. However, this approach does not differentiate between high-level, purpose-focused responses and low-level, implementation-focused ones, limiting its effectiveness in assessing comprehension level. We propose a modified approach where students generate function names, emphasizing the function's purpose over implementation details. We evaluate this method in an introductory programming course and analyze it using Item Response Theory (IRT) to understand its effectiveness as exam items and its alignment with traditional EiPE grading standards. We also publish this work as an open source Python package for autograding EiPE questions, providing a scalable solution for adoption.

Paper Structure

This paper contains 13 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Sample interface for an Explain in Plain English (EiPE) question.
  • Figure 2: The interface of the function redefinition EiPE task.
  • Figure 3: Lengths of all responses (N=647) submitted for each question
  • Figure 4: Correctness of all responses (N=647) for each question.
  • Figure 5: Item Statistics from the Results of Fitting 2PL IRT
  • ...and 1 more figures