Table of Contents
Fetching ...

Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments

Alona Strugatski, Giora Alexandron

TL;DR

This paper addresses AI-generated cheating in MCQ assessments by applying Item Response Theory and nonparametric Person-Fit Statistics to distinguish GenAI from human responses. Using two real-world instruments and three GenAI models, the study shows that PFS measures can reliably separate human and AI patterns and reveal inter-model differences. However, detectability declines as AI usage becomes more prevalent (pollution level rises), with the $ZU3$ statistic being particularly sensitive to these changes. The work establishes a theory-grounded framework for GenAI cheating detection in MCQs and outlines limitations and directions for future refinement.

Abstract

Generative AI is transforming the educational landscape, raising significant concerns about cheating. Despite the widespread use of multiple-choice questions in assessments, the detection of AI cheating in MCQ-based tests has been almost unexplored, in contrast to the focus on detecting AI-cheating on text-rich student outputs. In this paper, we propose a method based on the application of Item Response Theory to address this gap. Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns, with AI cheating manifesting as deviations from the expected patterns of human responses. These deviations are modeled using Person-Fit Statistics. We demonstrate that this method effectively highlights the differences between human responses and those generated by premium versions of leading chatbots (ChatGPT, Claude, and Gemini), but that it is also sensitive to the amount of AI cheating in the data. Furthermore, we show that the chatbots differ in their reasoning profiles. Our work provides both a theoretical foundation and empirical evidence for the application of IRT to identify AI cheating in MCQ-based assessments.

Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments

TL;DR

This paper addresses AI-generated cheating in MCQ assessments by applying Item Response Theory and nonparametric Person-Fit Statistics to distinguish GenAI from human responses. Using two real-world instruments and three GenAI models, the study shows that PFS measures can reliably separate human and AI patterns and reveal inter-model differences. However, detectability declines as AI usage becomes more prevalent (pollution level rises), with the statistic being particularly sensitive to these changes. The work establishes a theory-grounded framework for GenAI cheating detection in MCQs and outlines limitations and directions for future refinement.

Abstract

Generative AI is transforming the educational landscape, raising significant concerns about cheating. Despite the widespread use of multiple-choice questions in assessments, the detection of AI cheating in MCQ-based tests has been almost unexplored, in contrast to the focus on detecting AI-cheating on text-rich student outputs. In this paper, we propose a method based on the application of Item Response Theory to address this gap. Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns, with AI cheating manifesting as deviations from the expected patterns of human responses. These deviations are modeled using Person-Fit Statistics. We demonstrate that this method effectively highlights the differences between human responses and those generated by premium versions of leading chatbots (ChatGPT, Claude, and Gemini), but that it is also sensitive to the amount of AI cheating in the data. Furthermore, we show that the chatbots differ in their reasoning profiles. Our work provides both a theoretical foundation and empirical evidence for the application of IRT to identify AI cheating in MCQ-based assessments.

Paper Structure

This paper contains 13 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Density plots of $G$ for human learners (blue) and conversational chatbots (red).
  • Figure 2: Results and density plots of PFS for conversational chatbots.
  • Figure 3: PFS mean with $\pm$1-STD range for ChatGPT with varying pollution rates for the chemistry and psychometric instruments.
  • Figure 4: Density plots PFS for human learners (blue) and conversational chatbots (red). All calculated using 5% of conversational chatbot responses. These figures show the difference between the human learners and the different conversational chatbots, for the two instruments used in this study - chemistry (formative) and psychometrics (summative).