Table of Contents
Fetching ...

Feature Extraction and Analysis for GPT-Generated Text

A. Selvioğlu, V. Adanova, M. Atagoziev

TL;DR

This study tackles the problem of distinguishing human-written from GPT-generated academic text by extracting 11 interpretable features spanning statistical, morphological, semantic, and lexical dimensions. It combines Random Forest with SHAP explanations and a paragraph-level BERT classifier to reveal both global and region-specific cues, finding that GPT outputs tend to have longer sentences and larger paragraphs with distinct word-length and prefix usage patterns, while semantic similarity to titles and paragraph-to-title alignment are strong indicators. The results show high classification accuracy at both feature-based (Abstracts 98%, Introductions 100%, Combined 93%) and paragraph levels (≈98%), and emphasize the value of interpretable cues and human oversight in detection. Overall, the work provides a practical, explainable framework for AI-content detection in academic writing, highlighting robust cues and limitations for real-world deployment.

Abstract

With the rise of advanced natural language models like GPT, distinguishing between human-written and GPT-generated text has become increasingly challenging and crucial across various domains, including academia. The long-standing issue of plagiarism has grown more pressing, now compounded by concerns about the authenticity of information, as it is not always clear whether the presented facts are genuine or fabricated. In this paper, we present a comprehensive study of feature extraction and analysis for differentiating between human-written and GPT-generated text. By applying machine learning classifiers to these extracted features, we evaluate the significance of each feature in detection. Our results demonstrate that human and GPT-generated texts exhibit distinct writing styles, which can be effectively captured by our features. Given sufficiently long text, the two can be differentiated with high accuracy.

Feature Extraction and Analysis for GPT-Generated Text

TL;DR

This study tackles the problem of distinguishing human-written from GPT-generated academic text by extracting 11 interpretable features spanning statistical, morphological, semantic, and lexical dimensions. It combines Random Forest with SHAP explanations and a paragraph-level BERT classifier to reveal both global and region-specific cues, finding that GPT outputs tend to have longer sentences and larger paragraphs with distinct word-length and prefix usage patterns, while semantic similarity to titles and paragraph-to-title alignment are strong indicators. The results show high classification accuracy at both feature-based (Abstracts 98%, Introductions 100%, Combined 93%) and paragraph levels (≈98%), and emphasize the value of interpretable cues and human oversight in detection. Overall, the work provides a practical, explainable framework for AI-content detection in academic writing, highlighting robust cues and limitations for real-world deployment.

Abstract

With the rise of advanced natural language models like GPT, distinguishing between human-written and GPT-generated text has become increasingly challenging and crucial across various domains, including academia. The long-standing issue of plagiarism has grown more pressing, now compounded by concerns about the authenticity of information, as it is not always clear whether the presented facts are genuine or fabricated. In this paper, we present a comprehensive study of feature extraction and analysis for differentiating between human-written and GPT-generated text. By applying machine learning classifiers to these extracted features, we evaluate the significance of each feature in detection. Our results demonstrate that human and GPT-generated texts exhibit distinct writing styles, which can be effectively captured by our features. Given sufficiently long text, the two can be differentiated with high accuracy.

Paper Structure

This paper contains 18 sections, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Average sentence length results. (left) Density plot of sentence length for GPT and human abstracts and introductions. (right) Branch based box plots for sentence length.
  • Figure 2: MTLD results. (left) Density plot of MTLD for GPT and human abstracts and introductions. (right) Branch based box plots for MTLD.
  • Figure 3: Density plot of average number of punctuation in sentences in for GPT and human abstracts and introductions.
  • Figure 4: Density plot of average number of sentences in paragraphs for GPT and human abstracts and introductions.
  • Figure 5: Entropy results. (left) Density plot of entropy for GPT and abstracts and introductions. (right) Branch based box plots for entropy.
  • ...and 10 more figures