Counting the Trees in the Forest: Evaluating Prompt Segmentation for Classifying Code Comprehension Level
David H. Smith, Max Fowler, Paul Denny, Craig Zilles
TL;DR
This work tackles the challenge of evaluating code comprehension in Explain-in-Plain-English tasks by distinguishing high-level, relational descriptions from low-level, multi-structural ones. It proposes a segmentation-based method where a large language model segments a student response and the target code, mapping segments to code lines to classify the response, with performance measured against human labels and enhanced by post-processing. Evaluations on data from a large introductory course show substantial agreement with human judgments, improved further by removing segments that solely describe function definitions, and supported by metrics such as Cohen's $\kappa$ and F1. The approach is released as an open-source Python package (eiplgrader) and discussed as a formative feedback mechanism, offering flexible, question-level tuning and visual mappings to help students bridge natural language explanations and implementation.
Abstract
Reading and understanding code are fundamental skills for novice programmers, and especially important with the growing prevalence of AI-generated code and the need to evaluate its accuracy and reliability. ``Explain in Plain English'' questions are a widely used approach for assessing code comprehension, but providing automated feedback, particularly on comprehension levels, is a challenging task. This paper introduces a novel method for automatically assessing the comprehension level of responses to ``Explain in Plain English'' questions. Central to this is the ability to distinguish between two response types: multi-structural, where students describe the code line-by-line, and relational, where they explain the code's overall purpose. Using a Large Language Model (LLM) to segment both the student's description and the code, we aim to determine whether the student describes each line individually (many segments) or the code as a whole (fewer segments). We evaluate this approach's effectiveness by comparing segmentation results with human classifications, achieving substantial agreement. We conclude with how this approach, which we release as an open source Python package, could be used as a formative feedback mechanism.
