Table of Contents
Fetching ...

Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

Cheng-Han Chiang, Wei-Chih Chen, Chun-Yi Kuan, Chienchou Yang, Hung-yi Lee

TL;DR

This study evaluates the deployment of GPT-4-based LLM TAs to automatically score assignments in a large course with 1,028 students. It provides real-world insights into student acceptability, the reliability of LLM-based grading, and vulnerabilities to prompt hacking, along with practical guidelines for implementation and future research directions. The findings show broad acceptability with free access, but reveal significant issues including format adherence failures, scoring misalignments, and manipulation risks. The work offers actionable recommendations for educators and highlights key areas for NLP research to improve instruction-following and defend against prompt-based attacks in large-scale educational settings.

Abstract

Using large language models (LLMs) for automatic evaluation has become an important evaluation method in NLP research. However, it is unclear whether these LLM-based evaluators can be applied in real-world classrooms to assess student assignments. This empirical report shares how we use GPT-4 as an automatic assignment evaluator in a university course with 1,028 students. Based on student responses, we find that LLM-based assignment evaluators are generally acceptable to students when students have free access to these LLM-based evaluators. However, students also noted that the LLM sometimes fails to adhere to the evaluation instructions. Additionally, we observe that students can easily manipulate the LLM-based evaluator to output specific strings, allowing them to achieve high scores without meeting the assignment rubric. Based on student feedback and our experience, we provide several recommendations for integrating LLM-based evaluators into future classrooms. Our observation also highlights potential directions for improving LLM-based evaluators, including their instruction-following ability and vulnerability to prompt hacking.

Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

TL;DR

This study evaluates the deployment of GPT-4-based LLM TAs to automatically score assignments in a large course with 1,028 students. It provides real-world insights into student acceptability, the reliability of LLM-based grading, and vulnerabilities to prompt hacking, along with practical guidelines for implementation and future research directions. The findings show broad acceptability with free access, but reveal significant issues including format adherence failures, scoring misalignments, and manipulation risks. The work offers actionable recommendations for educators and highlights key areas for NLP research to improve instruction-following and defend against prompt-based attacks in large-scale educational settings.

Abstract

Using large language models (LLMs) for automatic evaluation has become an important evaluation method in NLP research. However, it is unclear whether these LLM-based evaluators can be applied in real-world classrooms to assess student assignments. This empirical report shares how we use GPT-4 as an automatic assignment evaluator in a university course with 1,028 students. Based on student responses, we find that LLM-based assignment evaluators are generally acceptable to students when students have free access to these LLM-based evaluators. However, students also noted that the LLM sometimes fails to adhere to the evaluation instructions. Additionally, we observe that students can easily manipulate the LLM-based evaluator to output specific strings, allowing them to achieve high scores without meeting the assignment rubric. Based on student feedback and our experience, we provide several recommendations for integrating LLM-based evaluators into future classrooms. Our observation also highlights potential directions for improving LLM-based evaluators, including their instruction-following ability and vulnerability to prompt hacking.
Paper Structure (60 sections, 6 figures, 8 tables)

This paper contains 60 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: How we use LLM TAs in our course: (1) The teaching team first creates an LLM TA by specifying the evaluation prompts. Next, (2) the student submits an assignment, and (3) the LLM TA outputs an evaluation result. Last, (4) the student submits this result to the teaching team, and the teaching team extracts a score from the evaluation result as the assignment's score.
  • Figure 2: Whether students can accept using LLM TAs before this course on a scale of 1 to 5, with 1 being the most unacceptable and 5 being the most acceptable. The results are broken down to students with and without ML backgrounds.
  • Figure 3: Whether students can accept using LLM TAs on a scale of 1 to 5 under different scenarios, with 1 being the most unacceptable and 5 being the most acceptable. The scenarios are the four options in Section \ref{['subsection: Possible Options of Using LLM TAs']} and an additional one (*), corresponding to option (3) with the constraint that the students cannot dispute the teacher-conducted score. Left: Students from EECS department. Right: Students from the Liberal Arts department.
  • Figure 4: Whether students can accept using LLM TAs before this course on a scale of 1 to 5, with 1 being the most unacceptable and 5 being the most acceptable. The results are broken down to students from EECS and Liberal Arts.
  • Figure 5: Whether students can accept using LLM TAs on a scale of 1 to 5 under different scenarios, with 1 being the most unacceptable and 5 being the most acceptable. The scenarios are the four options in Section \ref{['subsection: Possible Options of Using LLM TAs']} and an additional one (*), corresponding to option (3) with the constraint that the students cannot dispute the teacher-conducted score. Left: Students with ML background. Right: Students without ML background.
  • ...and 1 more figures