Table of Contents
Fetching ...

Automated Text Identification Using CNN and Training Dynamics

Claudiu Creanga, Liviu Petrisor Dinu

TL;DR

The paper addresses the challenge of distinguishing human- versus AI-generated text and improving generalization to unseen writing styles. It leverages Data Maps to analyze training dynamics of a CNN-based classifier on the AuTexTification dataset, identifying three learnability regions and demonstrating that training on ambiguous examples can boost out-of-distribution performance. Empirical results show that focusing on ambiguous samples yields higher F1 scores (around 0.66) than training on the full dataset, suggesting data selection strategies that emphasize ambiguity can enhance robustness to novel domains. The work informs dataset design and training strategies for AI-generated text detection, with practical implications for mitigating misinformation and ensuring ethical AI use.

Abstract

We used Data Maps to model and characterize the AuTexTification dataset. This provides insights about the behaviour of individual samples during training across epochs (training dynamics). We characterized the samples across 3 dimensions: confidence, variability and correctness. This shows the presence of 3 regions: easy-to-learn, ambiguous and hard-to-learn examples. We used a classic CNN architecture and found out that training the model only on a subset of ambiguous examples improves the model's out-of-distribution generalization.

Automated Text Identification Using CNN and Training Dynamics

TL;DR

The paper addresses the challenge of distinguishing human- versus AI-generated text and improving generalization to unseen writing styles. It leverages Data Maps to analyze training dynamics of a CNN-based classifier on the AuTexTification dataset, identifying three learnability regions and demonstrating that training on ambiguous examples can boost out-of-distribution performance. Empirical results show that focusing on ambiguous samples yields higher F1 scores (around 0.66) than training on the full dataset, suggesting data selection strategies that emphasize ambiguity can enhance robustness to novel domains. The work informs dataset design and training strategies for AI-generated text detection, with practical implications for mitigating misinformation and ensuring ethical AI use.

Abstract

We used Data Maps to model and characterize the AuTexTification dataset. This provides insights about the behaviour of individual samples during training across epochs (training dynamics). We characterized the samples across 3 dimensions: confidence, variability and correctness. This shows the presence of 3 regions: easy-to-learn, ambiguous and hard-to-learn examples. We used a classic CNN architecture and found out that training the model only on a subset of ambiguous examples improves the model's out-of-distribution generalization.
Paper Structure (5 sections, 3 figures, 3 tables)

This paper contains 5 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Data map for AuTexTification train dataset, based on our CNN classifier presented in \ref{['sec:Model']} .
  • Figure 2: Density plots for the three dimensions.
  • Figure 3: Length distribution for human and generated text. On the x-axis we have the length of characters in each example and on the y-axis the number of examples. The two classes don't differ in the length of characters.