Table of Contents
Fetching ...

Automated Assessment of Students' Code Comprehension using LLMs

Priti Oli, Rabin Banjade, Jeevan Chapagain, Vasile Rus

TL;DR

The paper addresses automatic assessment of students' line-by-line code explanations by comparing LLM-based prompting with encoder-based semantic-textual-similarity models on the CodeCorpus dataset. It evaluates multiple encoders (SBERT variants, CodeBERT, BERTScore, USE) and prompting strategies (0–1 and 1–5 scales, few-shot, chain-of-thought) across several LLMs, including GPT-3.5, GPT-4, GPT-4 Turbo, and LLama-2. Findings show that fine-tuned encoders achieve strong correlations, while GPT-4 with chain-of-thought prompting approaches encoder performance and can outperform baseline prompting, indicating LLM-based assessment as a viable, scalable alternative for CS education feedback. The work highlights that numerical reasoning remains a challenge for LLMs and suggests future work to bolster robustness by integrating encoder signals and refining prompting strategies for more reliable feedback in programming education.

Abstract

Assessing student's answers and in particular natural language answers is a crucial challenge in the field of education. Advances in machine learning, including transformer-based models such as Large Language Models(LLMs), have led to significant progress in various natural language tasks. Nevertheless, amidst the growing trend of evaluating LLMs across diverse tasks, evaluating LLMs in the realm of automated answer assesment has not received much attention. To address this gap, we explore the potential of using LLMs for automated assessment of student's short and open-ended answer. Particularly, we use LLMs to compare students' explanations with expert explanations in the context of line-by-line explanations of computer programs. For comparison purposes, we assess both Large Language Models (LLMs) and encoder-based Semantic Textual Similarity (STS) models in the context of assessing the correctness of students' explanation of computer code. Our findings indicate that LLMs, when prompted in few-shot and chain-of-thought setting perform comparable to fine-tuned encoder-based models in evaluating students' short answers in programming domain.

Automated Assessment of Students' Code Comprehension using LLMs

TL;DR

The paper addresses automatic assessment of students' line-by-line code explanations by comparing LLM-based prompting with encoder-based semantic-textual-similarity models on the CodeCorpus dataset. It evaluates multiple encoders (SBERT variants, CodeBERT, BERTScore, USE) and prompting strategies (0–1 and 1–5 scales, few-shot, chain-of-thought) across several LLMs, including GPT-3.5, GPT-4, GPT-4 Turbo, and LLama-2. Findings show that fine-tuned encoders achieve strong correlations, while GPT-4 with chain-of-thought prompting approaches encoder performance and can outperform baseline prompting, indicating LLM-based assessment as a viable, scalable alternative for CS education feedback. The work highlights that numerical reasoning remains a challenge for LLMs and suggests future work to bolster robustness by integrating encoder signals and refining prompting strategies for more reliable feedback in programming education.

Abstract

Assessing student's answers and in particular natural language answers is a crucial challenge in the field of education. Advances in machine learning, including transformer-based models such as Large Language Models(LLMs), have led to significant progress in various natural language tasks. Nevertheless, amidst the growing trend of evaluating LLMs across diverse tasks, evaluating LLMs in the realm of automated answer assesment has not received much attention. To address this gap, we explore the potential of using LLMs for automated assessment of student's short and open-ended answer. Particularly, we use LLMs to compare students' explanations with expert explanations in the context of line-by-line explanations of computer programs. For comparison purposes, we assess both Large Language Models (LLMs) and encoder-based Semantic Textual Similarity (STS) models in the context of assessing the correctness of students' explanation of computer code. Our findings indicate that LLMs, when prompted in few-shot and chain-of-thought setting perform comparable to fine-tuned encoder-based models in evaluating students' short answers in programming domain.
Paper Structure (16 sections, 3 tables)