Table of Contents
Fetching ...

Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions

Madhav S. Baidya, S. S. Baidya, Chirag Chawla

Abstract

The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross-domain transfer, cross-LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human-ChatGPT pairs) and ELI5 (15,000 human-Mistral-7B pairs). Methods include classical classifiers, fine-tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), a CNN, an XGBoost stylometric model, perplexity-based detectors, and LLM-as-detector prompting. Results show that transformer models achieve near-perfect in-distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM-based detectors underperform and are affected by generator-detector identity bias. Perplexity-based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources.

Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions

Abstract

The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross-domain transfer, cross-LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human-ChatGPT pairs) and ELI5 (15,000 human-Mistral-7B pairs). Methods include classical classifiers, fine-tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), a CNN, an XGBoost stylometric model, perplexity-based detectors, and LLM-as-detector prompting. Results show that transformer models achieve near-perfect in-distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM-based detectors underperform and are affected by generator-detector identity bias. Perplexity-based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources.
Paper Structure (77 sections, 7 equations, 16 figures, 39 tables)

This paper contains 77 sections, 7 equations, 16 figures, 39 tables.

Figures (16)

  • Figure 1: Overview of the benchmark pipeline. Stage 0 constructs two paired corpora (HC3: 23k human--ChatGPT pairs; ELI5: 15k human--Mistral-7B pairs) with length-matched preprocessing. Stage 1 evaluates three detector families: Family 1 (classical statistical classifiers), and Family 2 (fine-tuned encoder transformers --- BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3; 1D-CNN; perplexity-based detectors; stylometric-hybrid XGBoost), and Family 3 (llm-as-detector prompting at four scales including GPT-4o-mini). Stage 2 evaluates cross-llm generalization via neural detectors, embedding-space classifier matrices , and distributional shift analysis. Stage 3 applies adversarial humanization at three levels (L0--L2) using an instruction-tuned rewriter. All families are evaluated under a unified five-metric suite (auroc, auprc, eer, Brier Score, FPR@95%TPR).
  • Figure 2: Calibration curves for classical detectors across four evaluation settings. Points close to the diagonal indicate well-calibrated confidence scores, while systematic deviations reflect over- or under-confidence.
  • Figure 3: Performance analysis of DistilBERT across four evaluation conditions. Top: score distributions indicating class separability. Middle: reliability diagrams assessing calibration. Bottom: ROC curves illustrating discrimination performance. DistilBERT achieves near-transformer performance at approximately 60% of BERT's parameter count.
  • Figure 4: Training dynamics and detectability behavior of the 1D-CNN detector. Top: rapid convergence to high validation AUC on both datasets. Bottom: score distributions indicating strong separability between human and llm text.
  • Figure 5: 1D-CNN degradation curve under progressive text humanization. The $x$-axis represents the fraction of human tokens mixed into otherwise llm-generated text. The steep, smooth decline confirms that the 1D-CNN is highly sensitive to even small amounts of human-style $n$-gram patterns.
  • ...and 11 more figures