Table of Contents
Fetching ...

Counting the Trees in the Forest: Evaluating Prompt Segmentation for Classifying Code Comprehension Level

David H. Smith, Max Fowler, Paul Denny, Craig Zilles

TL;DR

This work tackles the challenge of evaluating code comprehension in Explain-in-Plain-English tasks by distinguishing high-level, relational descriptions from low-level, multi-structural ones. It proposes a segmentation-based method where a large language model segments a student response and the target code, mapping segments to code lines to classify the response, with performance measured against human labels and enhanced by post-processing. Evaluations on data from a large introductory course show substantial agreement with human judgments, improved further by removing segments that solely describe function definitions, and supported by metrics such as Cohen's $\kappa$ and F1. The approach is released as an open-source Python package (eiplgrader) and discussed as a formative feedback mechanism, offering flexible, question-level tuning and visual mappings to help students bridge natural language explanations and implementation.

Abstract

Reading and understanding code are fundamental skills for novice programmers, and especially important with the growing prevalence of AI-generated code and the need to evaluate its accuracy and reliability. ``Explain in Plain English'' questions are a widely used approach for assessing code comprehension, but providing automated feedback, particularly on comprehension levels, is a challenging task. This paper introduces a novel method for automatically assessing the comprehension level of responses to ``Explain in Plain English'' questions. Central to this is the ability to distinguish between two response types: multi-structural, where students describe the code line-by-line, and relational, where they explain the code's overall purpose. Using a Large Language Model (LLM) to segment both the student's description and the code, we aim to determine whether the student describes each line individually (many segments) or the code as a whole (fewer segments). We evaluate this approach's effectiveness by comparing segmentation results with human classifications, achieving substantial agreement. We conclude with how this approach, which we release as an open source Python package, could be used as a formative feedback mechanism.

Counting the Trees in the Forest: Evaluating Prompt Segmentation for Classifying Code Comprehension Level

TL;DR

This work tackles the challenge of evaluating code comprehension in Explain-in-Plain-English tasks by distinguishing high-level, relational descriptions from low-level, multi-structural ones. It proposes a segmentation-based method where a large language model segments a student response and the target code, mapping segments to code lines to classify the response, with performance measured against human labels and enhanced by post-processing. Evaluations on data from a large introductory course show substantial agreement with human judgments, improved further by removing segments that solely describe function definitions, and supported by metrics such as Cohen's and F1. The approach is released as an open-source Python package (eiplgrader) and discussed as a formative feedback mechanism, offering flexible, question-level tuning and visual mappings to help students bridge natural language explanations and implementation.

Abstract

Reading and understanding code are fundamental skills for novice programmers, and especially important with the growing prevalence of AI-generated code and the need to evaluate its accuracy and reliability. ``Explain in Plain English'' questions are a widely used approach for assessing code comprehension, but providing automated feedback, particularly on comprehension levels, is a challenging task. This paper introduces a novel method for automatically assessing the comprehension level of responses to ``Explain in Plain English'' questions. Central to this is the ability to distinguish between two response types: multi-structural, where students describe the code line-by-line, and relational, where they explain the code's overall purpose. Using a Large Language Model (LLM) to segment both the student's description and the code, we aim to determine whether the student describes each line individually (many segments) or the code as a whole (fewer segments). We evaluate this approach's effectiveness by comparing segmentation results with human classifications, achieving substantial agreement. We conclude with how this approach, which we release as an open source Python package, could be used as a formative feedback mechanism.

Paper Structure

This paper contains 18 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: The Components of the Prompt Used for the Segmentation Approach Used in this Study.
  • Figure 2: Question interface for A-Q3: Index of last zero.
  • Figure 3: The mean and stdev in the number of segments generated for multistructural and relational responses.
  • Figure 4: Performance of the segmentation approach for classifying student responses. Responses containing a number of segments above each of given thresholds is classified as multistructural and those at or below are classified as relational.
  • Figure 5: Potential student-facing feedback mechanisms for use with prompt segmentation classification