Table of Contents
Fetching ...

LLMs in the Classroom: Outcomes and Perceptions of Questions Written with the Aid of AI

Gavin Witsken, Igor Crk, Eren Gultepe

TL;DR

The study examines whether MCQs authored with the aid of an LLM and rigorously validated by an instructor perform similarly to human-authored items in an OS course and whether students can reliably detect authorship. In a controlled classroom setting, 32 questions (24 AI-assisted counterparts) were deployed to 25 students, with 714 data points collected; results show significantly lower scores on AI-authored items and no significant difference in students' ability to perceive authorship, while AI questions were more aligned with the course textbook. The authors implement a two-pass validation workflow, SBERT-based similarity measures, and multiple statistical analyses (Mann-Whitney U, LR, CIT, cross-validation, clustering) to triangulate effects on performance and perception. Findings suggest AI-assisted item construction is feasible but requires careful validation and awareness of potential impact on fairness and learning outcomes, underscoring the need for replication and discipline-wide examination of AI-enabled assessment design.

Abstract

We randomly deploy questions constructed with and without use of the LLM tool and gauge the ability of the students to correctly answer, as well as their ability to correctly perceive the difference between human-authored and LLM-authored questions. In determining whether the questions written with the aid of ChatGPT were consistent with the instructor's questions and source text, we computed representative vectors of both the human and ChatGPT questions using SBERT and compared cosine similarity to the course textbook. A non-significant Mann-Whitney U test (z = 1.018, p = .309) suggests that students were unable to perceive whether questions were written with or without the aid of ChatGPT. However, student scores on LLM-authored questions were almost 9% lower (z = 2.702, p < .01). This result may indicate that either the AI questions were more difficult or that the students were more familiar with the instructor's style of questions. Overall, the study suggests that while there is potential for using LLM tools to aid in the construction of assessments, care must be taken to ensure that the questions are fair, well-composed, and relevant to the course material.

LLMs in the Classroom: Outcomes and Perceptions of Questions Written with the Aid of AI

TL;DR

The study examines whether MCQs authored with the aid of an LLM and rigorously validated by an instructor perform similarly to human-authored items in an OS course and whether students can reliably detect authorship. In a controlled classroom setting, 32 questions (24 AI-assisted counterparts) were deployed to 25 students, with 714 data points collected; results show significantly lower scores on AI-authored items and no significant difference in students' ability to perceive authorship, while AI questions were more aligned with the course textbook. The authors implement a two-pass validation workflow, SBERT-based similarity measures, and multiple statistical analyses (Mann-Whitney U, LR, CIT, cross-validation, clustering) to triangulate effects on performance and perception. Findings suggest AI-assisted item construction is feasible but requires careful validation and awareness of potential impact on fairness and learning outcomes, underscoring the need for replication and discipline-wide examination of AI-enabled assessment design.

Abstract

We randomly deploy questions constructed with and without use of the LLM tool and gauge the ability of the students to correctly answer, as well as their ability to correctly perceive the difference between human-authored and LLM-authored questions. In determining whether the questions written with the aid of ChatGPT were consistent with the instructor's questions and source text, we computed representative vectors of both the human and ChatGPT questions using SBERT and compared cosine similarity to the course textbook. A non-significant Mann-Whitney U test (z = 1.018, p = .309) suggests that students were unable to perceive whether questions were written with or without the aid of ChatGPT. However, student scores on LLM-authored questions were almost 9% lower (z = 2.702, p < .01). This result may indicate that either the AI questions were more difficult or that the students were more familiar with the instructor's style of questions. Overall, the study suggests that while there is potential for using LLM tools to aid in the construction of assessments, care must be taken to ensure that the questions are fair, well-composed, and relevant to the course material.

Paper Structure

This paper contains 17 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Steps from question generation to data collection. We attempt to match each instructor-authored multiple-choice question (A) to a LLM-generated question (B). If (B) passes both human validation steps, it's added to the question bank along with the paired instructor-authored question. Each student's assessment includes either (A) or (B), chosen at random. Data collected includes the student's score, perception (whether question appears to be human- or LLM-authored) and the cosine similarity of the question and the course's textbook.
  • Figure 2: Histograms showing the effect of human written and AI generated questions on student performance. A) A Mann-Whitney U (MWU) test (z = 1.02, p = .31) showed that there was no difference in student's ability to detect human (M = 0.62, SD = 0.28) from AI (M = 0.60, SD = 0.27) questions. B) A MWU (z = 2.70, p$<$ .01) showed that students scored higher on human (M = 0.83, SD = 0.37) written questions rather than AI-generated ones (M = 0.74, SD = 0.43). C) A MWU (z = -5.89, p$<$ .001) showed that human written questions (M = 0.14, SD = 0.11) were significantly less similar to the course textbook than the AI-generated ones (M = 0.21, SD = 0.15).
  • Figure 3: If a student's Perception value is 40$\%$ and the true question authorship is human, then the AD is 0.6 and if the true authorship is LLM, then the AD is 0.4.
  • Figure 4: A) The tree shows the interaction between cosine similarity and score in ascertaining whether a question is authored by a human or AI. Due to the negative correlation ($r_{s}$ = -.13, p$<$ .001) between cosine similarity and human-authorship, we can expect that when cosine similarity is $\leq$ 0.242, more human-authored questions will be represented (light-gray portion of leaf nodes), and otherwise we expect to see more AI-authored questions (dark-gray portion of leaf nodes). Based on this separation of MCQs, we can also compute that when cosine similarity is $\leq$ 0.242 students scored better on human written questions (mean error rate of 0.23) than AI generated questions (mean error rate of 0.303). B) CIT had a higher (AUC = 0.76) than LR (AUC = 0.65) for determining question authorship, which could be due to better modelling of the interaction of the Cosine Similarity and Scores.
  • Figure 5: Hierarchical clustering using each student's average absolute difference (AD) values strongly suggests that there are three distinct groups of students (silhouette = 0.570). Students in cluster two had the best ability in differentiating authorship among MCQs.