Table of Contents
Fetching ...

A Study on the Vulnerability of Test Questions against ChatGPT-based Cheating

Shanker Ram, Chen Qian

TL;DR

This study assesses the vulnerability of multiple-choice test questions to ChatGPT-based cheating, focusing on medical-domain items from the MedMCQA dataset. It demonstrates that ChatGPT achieves around 60%–63% accuracy on these questions, with susceptibility strongly influenced by topic and answer-choice structure rather than basic lexical features. The authors develop a PyTorch-based NLP predictor that classifies questions as GPT-susceptible with about 60% accuracy (rising above 70% at high confidence), and show its potential to help educators filter vulnerable items. Additional validation on a smaller physics dataset and analysis across GPT-4 vs GPT-3.5-turbo suggest practical implications for test design and emphasize the need for cross-domain validation with other chatbots.

Abstract

ChatGPT is a chatbot that can answer text prompts fairly accurately, even performing very well on postgraduate-level questions. Many educators have found that their take-home or remote tests and exams are vulnerable to ChatGPT-based cheating because students may directly use answers provided by tools like ChatGPT. In this paper, we try to provide an answer to an important question: how well ChatGPT can answer test questions and how we can detect whether the questions of a test can be answered correctly by ChatGPT. We generated ChatGPT's responses to the MedMCQA dataset, which contains over 10,000 medical school entrance exam questions. We analyzed the responses and uncovered certain types of questions ChatGPT answers more inaccurately than others. In addition, we have created a basic natural language processing model to single out the most vulnerable questions to ChatGPT in a collection of questions or a sample exam. Our tool can be used by test-makers to avoid ChatGPT-vulnerable test questions.

A Study on the Vulnerability of Test Questions against ChatGPT-based Cheating

TL;DR

This study assesses the vulnerability of multiple-choice test questions to ChatGPT-based cheating, focusing on medical-domain items from the MedMCQA dataset. It demonstrates that ChatGPT achieves around 60%–63% accuracy on these questions, with susceptibility strongly influenced by topic and answer-choice structure rather than basic lexical features. The authors develop a PyTorch-based NLP predictor that classifies questions as GPT-susceptible with about 60% accuracy (rising above 70% at high confidence), and show its potential to help educators filter vulnerable items. Additional validation on a smaller physics dataset and analysis across GPT-4 vs GPT-3.5-turbo suggest practical implications for test design and emphasize the need for cross-domain validation with other chatbots.

Abstract

ChatGPT is a chatbot that can answer text prompts fairly accurately, even performing very well on postgraduate-level questions. Many educators have found that their take-home or remote tests and exams are vulnerable to ChatGPT-based cheating because students may directly use answers provided by tools like ChatGPT. In this paper, we try to provide an answer to an important question: how well ChatGPT can answer test questions and how we can detect whether the questions of a test can be answered correctly by ChatGPT. We generated ChatGPT's responses to the MedMCQA dataset, which contains over 10,000 medical school entrance exam questions. We analyzed the responses and uncovered certain types of questions ChatGPT answers more inaccurately than others. In addition, we have created a basic natural language processing model to single out the most vulnerable questions to ChatGPT in a collection of questions or a sample exam. Our tool can be used by test-makers to avoid ChatGPT-vulnerable test questions.
Paper Structure (20 sections, 6 figures, 11 tables)

This paper contains 20 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Categorical scatterplot with slight random jitter to avoid overplotting
  • Figure 2: Categorical scatterplots with slight random jitter to avoid overplotting
  • Figure 3: Example of multi-select problem, the correct answer is A and D
  • Figure 4: Example of problem with "except" in it
  • Figure 5: Model Architecture
  • ...and 1 more figures