A Study on the Vulnerability of Test Questions against ChatGPT-based Cheating

Shanker Ram; Chen Qian

A Study on the Vulnerability of Test Questions against ChatGPT-based Cheating

Shanker Ram, Chen Qian

TL;DR

This study assesses the vulnerability of multiple-choice test questions to ChatGPT-based cheating, focusing on medical-domain items from the MedMCQA dataset. It demonstrates that ChatGPT achieves around 60%–63% accuracy on these questions, with susceptibility strongly influenced by topic and answer-choice structure rather than basic lexical features. The authors develop a PyTorch-based NLP predictor that classifies questions as GPT-susceptible with about 60% accuracy (rising above 70% at high confidence), and show its potential to help educators filter vulnerable items. Additional validation on a smaller physics dataset and analysis across GPT-4 vs GPT-3.5-turbo suggest practical implications for test design and emphasize the need for cross-domain validation with other chatbots.

Abstract

ChatGPT is a chatbot that can answer text prompts fairly accurately, even performing very well on postgraduate-level questions. Many educators have found that their take-home or remote tests and exams are vulnerable to ChatGPT-based cheating because students may directly use answers provided by tools like ChatGPT. In this paper, we try to provide an answer to an important question: how well ChatGPT can answer test questions and how we can detect whether the questions of a test can be answered correctly by ChatGPT. We generated ChatGPT's responses to the MedMCQA dataset, which contains over 10,000 medical school entrance exam questions. We analyzed the responses and uncovered certain types of questions ChatGPT answers more inaccurately than others. In addition, we have created a basic natural language processing model to single out the most vulnerable questions to ChatGPT in a collection of questions or a sample exam. Our tool can be used by test-makers to avoid ChatGPT-vulnerable test questions.

A Study on the Vulnerability of Test Questions against ChatGPT-based Cheating

TL;DR

Abstract

Paper Structure (20 sections, 6 figures, 11 tables)

This paper contains 20 sections, 6 figures, 11 tables.

Introduction
Dataset
Data Collection
Apparent Trends in Data
Structuring and Complexity of Questions has no Effect on ChatGPT
Using Multi-Select Problems has Little to no Effect on ChatGPT’s Performance
Adding Extra Option Choices has no Effect on ChatGPT’s Performance
ChatGPT Struggles More with Questions with the Word "except" in Them
A Major Indicator of ChatGPT’s Success at Answering a Question Correctly is the Topic the Question is Based On
ChatGPT Drastically Overpredicts the Options “All of the above” and “None of the above”
Natural Language Processing Model
Preprocessing of Data
Dataset Splits
Neural Network Architecture
Training
...and 5 more sections

Figures (6)

Figure 1: Categorical scatterplot with slight random jitter to avoid overplotting
Figure 2: Categorical scatterplots with slight random jitter to avoid overplotting
Figure 3: Example of multi-select problem, the correct answer is A and D
Figure 4: Example of problem with "except" in it
Figure 5: Model Architecture
...and 1 more figures

A Study on the Vulnerability of Test Questions against ChatGPT-based Cheating

TL;DR

Abstract

A Study on the Vulnerability of Test Questions against ChatGPT-based Cheating

Authors

TL;DR

Abstract

Table of Contents

Figures (6)