Table of Contents
Fetching ...

PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks

Yiwei Zha, Rui Min, Shanu Sushmita

TL;DR

PADBen tackles the robustness of AI text detectors against paraphrase attacks, with a focus on iterative laundering. It introduces a five-type text taxonomy and five detection tasks across sentence-level and pairwise formats. A dual representation space analysis reveals an intermediate laundering region where iterative paraphrasing causes semantic drift while preserving generation patterns, enabling authorship obfuscation and plagiarism evasion. Evaluating 11 detectors across zero-shot and model-based families shows a pronounced asymmetry: plagiarism evasion remains detectable while authorship obfuscation drives near-random performance for source attribution, underscoring the need for new detector architectures beyond current semantic and stylistic cues. Code and benchmarks are publicly available at the PADBen GitHub repository.

Abstract

While AI-generated text (AIGT) detectors achieve over 90\% accuracy on direct LLM outputs, they fail catastrophically against iteratively-paraphrased content. We investigate why iteratively-paraphrased text -- itself AI-generated -- evades detection systems designed for AIGT identification. Through intrinsic mechanism analysis, we reveal that iterative paraphrasing creates an intermediate laundering region characterized by semantic displacement with preserved generation patterns, which brings up two attack categories: paraphrasing human-authored text (authorship obfuscation) and paraphrasing LLM-generated text (plagiarism evasion). To address these vulnerabilities, we introduce PADBen, the first benchmark systematically evaluating detector robustness against both paraphrase attack scenarios. PADBen comprises a five-type text taxonomy capturing the full trajectory from original content to deeply laundered text, and five progressive detection tasks across sentence-pair and single-sentence challenges. We evaluate 11 state-of-the-art detectors, revealing critical asymmetry: detectors successfully identify the plagiarism evasion problem but fail for the case of authorship obfuscation. Our findings demonstrate that current detection approaches cannot effectively handle the intermediate laundering region, necessitating fundamental advances in detection architectures beyond existing semantic and stylistic discrimination methods. For detailed code implementation, please see https://github.com/JonathanZha47/PadBen-Paraphrase-Attack-Benchmark.

PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks

TL;DR

PADBen tackles the robustness of AI text detectors against paraphrase attacks, with a focus on iterative laundering. It introduces a five-type text taxonomy and five detection tasks across sentence-level and pairwise formats. A dual representation space analysis reveals an intermediate laundering region where iterative paraphrasing causes semantic drift while preserving generation patterns, enabling authorship obfuscation and plagiarism evasion. Evaluating 11 detectors across zero-shot and model-based families shows a pronounced asymmetry: plagiarism evasion remains detectable while authorship obfuscation drives near-random performance for source attribution, underscoring the need for new detector architectures beyond current semantic and stylistic cues. Code and benchmarks are publicly available at the PADBen GitHub repository.

Abstract

While AI-generated text (AIGT) detectors achieve over 90\% accuracy on direct LLM outputs, they fail catastrophically against iteratively-paraphrased content. We investigate why iteratively-paraphrased text -- itself AI-generated -- evades detection systems designed for AIGT identification. Through intrinsic mechanism analysis, we reveal that iterative paraphrasing creates an intermediate laundering region characterized by semantic displacement with preserved generation patterns, which brings up two attack categories: paraphrasing human-authored text (authorship obfuscation) and paraphrasing LLM-generated text (plagiarism evasion). To address these vulnerabilities, we introduce PADBen, the first benchmark systematically evaluating detector robustness against both paraphrase attack scenarios. PADBen comprises a five-type text taxonomy capturing the full trajectory from original content to deeply laundered text, and five progressive detection tasks across sentence-pair and single-sentence challenges. We evaluate 11 state-of-the-art detectors, revealing critical asymmetry: detectors successfully identify the plagiarism evasion problem but fail for the case of authorship obfuscation. Our findings demonstrate that current detection approaches cannot effectively handle the intermediate laundering region, necessitating fundamental advances in detection architectures beyond existing semantic and stylistic discrimination methods. For detailed code implementation, please see https://github.com/JonathanZha47/PadBen-Paraphrase-Attack-Benchmark.

Paper Structure

This paper contains 94 sections, 8 equations, 9 figures, 9 tables, 6 algorithms.

Figures (9)

  • Figure 1: Overall pipeline for benchmark curation. Preprocessing details in Appendix \ref{['Appendix_A']}, data generation in Appendix \ref{['appC:data_generation']}.
  • Figure 2: Overall Task introduction for the Benchmark. Task 1-5 measures the detector's different capabilities covering the robustness, performance when encountering the paraphrase attack. Detailed task specific can be found in Appendix.\ref{['appC:task_intro']}.
  • Figure 3: Two evaluation challenges: single-sentence classification and sentence-pair recognition. All five tasks are transformed into these two challenge formats. Detailed setup is provided in Appendix \ref{['appD:evaluation_setup']}.
  • Figure 4: The complete integration of HLPC, MRPC, and PAWS datasets follows a systematic pipeline that encompasses data loading, standardization, quality control, and deduplication. This comprehensive approach ensures data integrity while maximizing the utility of each source dataset..
  • Figure 5: PCA projection of semantic space (left) and K-means clustering results (right, $k=3$). Despite measurable distance differences (Table \ref{['tab:pairwise_distances']}), text categories show substantial overlap in 2D projection, indicating that distinguishing information exists in higher dimensions beyond principal components.
  • ...and 4 more figures