Table of Contents
Fetching ...

HowkGPT: Investigating the Detection of ChatGPT-generated University Student Homework through Context-Aware Perplexity Analysis

Christoforos Vasilatos, Manaar Alam, Talal Rahwan, Yasir Zaki, Michail Maniatakos

TL;DR

The paper tackles the challenge of detecting AI-generated university homework by introducing HowkGPT, a perplexity-based detector that leverages metadata-driven, category-specific thresholds. It relies on a pretrained GPT-2 model to compute perplexities on a dataset of student and ChatGPT responses, augmented by knowledge and cognitive process categorizations. ROC-AUC and F1 metrics guide optimal thresholds, with experiments showing improved accuracy when applying category-based thresholds and dataset flavors that filter noise. The work also provides an offline-and-online workflow and a public web application, contributing a practical framework to uphold academic integrity amid evolving LLM capabilities.

Abstract

As the use of Large Language Models (LLMs) in text generation tasks proliferates, concerns arise over their potential to compromise academic integrity. The education sector currently tussles with distinguishing student-authored homework assignments from AI-generated ones. This paper addresses the challenge by introducing HowkGPT, designed to identify homework assignments generated by AI. HowkGPT is built upon a dataset of academic assignments and accompanying metadata [17] and employs a pretrained LLM to compute perplexity scores for student-authored and ChatGPT-generated responses. These scores then assist in establishing a threshold for discerning the origin of a submitted assignment. Given the specificity and contextual nature of academic work, HowkGPT further refines its analysis by defining category-specific thresholds derived from the metadata, enhancing the precision of the detection. This study emphasizes the critical need for effective strategies to uphold academic integrity amidst the growing influence of LLMs and provides an approach to ensuring fair and accurate grading in educational institutions.

HowkGPT: Investigating the Detection of ChatGPT-generated University Student Homework through Context-Aware Perplexity Analysis

TL;DR

The paper tackles the challenge of detecting AI-generated university homework by introducing HowkGPT, a perplexity-based detector that leverages metadata-driven, category-specific thresholds. It relies on a pretrained GPT-2 model to compute perplexities on a dataset of student and ChatGPT responses, augmented by knowledge and cognitive process categorizations. ROC-AUC and F1 metrics guide optimal thresholds, with experiments showing improved accuracy when applying category-based thresholds and dataset flavors that filter noise. The work also provides an offline-and-online workflow and a public web application, contributing a practical framework to uphold academic integrity amid evolving LLM capabilities.

Abstract

As the use of Large Language Models (LLMs) in text generation tasks proliferates, concerns arise over their potential to compromise academic integrity. The education sector currently tussles with distinguishing student-authored homework assignments from AI-generated ones. This paper addresses the challenge by introducing HowkGPT, designed to identify homework assignments generated by AI. HowkGPT is built upon a dataset of academic assignments and accompanying metadata [17] and employs a pretrained LLM to compute perplexity scores for student-authored and ChatGPT-generated responses. These scores then assist in establishing a threshold for discerning the origin of a submitted assignment. Given the specificity and contextual nature of academic work, HowkGPT further refines its analysis by defining category-specific thresholds derived from the metadata, enhancing the precision of the detection. This study emphasizes the critical need for effective strategies to uphold academic integrity amidst the growing influence of LLMs and provides an approach to ensuring fair and accurate grading in educational institutions.
Paper Structure (17 sections, 1 equation, 8 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 1 equation, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: An illustrative example of perplexity scores computed using HowkGPT for different options as the next word given a specific context (highlighted in gray), where 'responsibly' is the default choice of ChatGPT.
  • Figure 2: The categorization defined by the professors providing the questions.
  • Figure 3: Offline and Live process flows of the application.
  • Figure 4: Distributions of perplexity values for the original dataset and also after applying different filtering strategies as mentioned in Table \ref{['tab:dataset_flavors']}.
  • Figure 5: ROC curves for different perplexity values (THR). Grayed out regions are the AUC for the optimal threshold. Each sub figure is generated using a dataset flavors.
  • ...and 3 more figures