LLMs in the Classroom: Outcomes and Perceptions of Questions Written with the Aid of AI
Gavin Witsken, Igor Crk, Eren Gultepe
TL;DR
The study examines whether MCQs authored with the aid of an LLM and rigorously validated by an instructor perform similarly to human-authored items in an OS course and whether students can reliably detect authorship. In a controlled classroom setting, 32 questions (24 AI-assisted counterparts) were deployed to 25 students, with 714 data points collected; results show significantly lower scores on AI-authored items and no significant difference in students' ability to perceive authorship, while AI questions were more aligned with the course textbook. The authors implement a two-pass validation workflow, SBERT-based similarity measures, and multiple statistical analyses (Mann-Whitney U, LR, CIT, cross-validation, clustering) to triangulate effects on performance and perception. Findings suggest AI-assisted item construction is feasible but requires careful validation and awareness of potential impact on fairness and learning outcomes, underscoring the need for replication and discipline-wide examination of AI-enabled assessment design.
Abstract
We randomly deploy questions constructed with and without use of the LLM tool and gauge the ability of the students to correctly answer, as well as their ability to correctly perceive the difference between human-authored and LLM-authored questions. In determining whether the questions written with the aid of ChatGPT were consistent with the instructor's questions and source text, we computed representative vectors of both the human and ChatGPT questions using SBERT and compared cosine similarity to the course textbook. A non-significant Mann-Whitney U test (z = 1.018, p = .309) suggests that students were unable to perceive whether questions were written with or without the aid of ChatGPT. However, student scores on LLM-authored questions were almost 9% lower (z = 2.702, p < .01). This result may indicate that either the AI questions were more difficult or that the students were more familiar with the instructor's style of questions. Overall, the study suggests that while there is potential for using LLM tools to aid in the construction of assessments, care must be taken to ensure that the questions are fair, well-composed, and relevant to the course material.
