Table of Contents
Fetching ...

DAMAGE: Detecting Adversarially Modified AI Generated Text

Elyas Masrour, Bradley Emi, Max Spero

TL;DR

This work addresses the vulnerability of AI-generated text detectors to humanizers that rewrite content to evade detection. It introduces DAMAGE, a detector trained with data-centric augmentation to learn invariances to both human and machine paraphrasing, achieving strong performance on standard AI text and RAID-style adversarial attacks. The study analyzes humanizer transformation patterns, demonstrates that even detector-specific adversarial humanization cannot fully erase detectable cues, and shows the detector generalizes across unseen tools. The findings have practical implications for deploying reliable AI-text detectors in education, SEO, and content moderation while suggesting pathways for further strengthening detector robustness.

Abstract

AI humanizers are a new class of online software tools meant to paraphrase and rewrite AI-generated text in a way that allows them to evade AI detection software. We study 19 AI humanizer and paraphrasing tools and qualitatively assess their effects and faithfulness in preserving the meaning of the original text. We show that many existing AI detectors fail to detect humanized text. Finally, we demonstrate a robust model that can detect humanized AI text while maintaining a low false positive rate using a data-centric augmentation approach. We attack our own detector, training our own fine-tuned model optimized against our detector's predictions, and show that our detector's cross-humanizer generalization is sufficient to remain robust to this attack.

DAMAGE: Detecting Adversarially Modified AI Generated Text

TL;DR

This work addresses the vulnerability of AI-generated text detectors to humanizers that rewrite content to evade detection. It introduces DAMAGE, a detector trained with data-centric augmentation to learn invariances to both human and machine paraphrasing, achieving strong performance on standard AI text and RAID-style adversarial attacks. The study analyzes humanizer transformation patterns, demonstrates that even detector-specific adversarial humanization cannot fully erase detectable cues, and shows the detector generalizes across unseen tools. The findings have practical implications for deploying reliable AI-text detectors in education, SEO, and content moderation while suggesting pathways for further strengthening detector robustness.

Abstract

AI humanizers are a new class of online software tools meant to paraphrase and rewrite AI-generated text in a way that allows them to evade AI detection software. We study 19 AI humanizer and paraphrasing tools and qualitatively assess their effects and faithfulness in preserving the meaning of the original text. We show that many existing AI detectors fail to detect humanized text. Finally, we demonstrate a robust model that can detect humanized AI text while maintaining a low false positive rate using a data-centric augmentation approach. We attack our own detector, training our own fine-tuned model optimized against our detector's predictions, and show that our detector's cross-humanizer generalization is sufficient to remain robust to this attack.
Paper Structure (39 sections, 7 figures, 9 tables)

This paper contains 39 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Example of an AI humanizer tool
  • Figure 2: Augmenting the training set with high quality humanizer data improves robustness.
  • Figure 3: Two out of the four most popular Writing Custom GPTs are Humanizers
  • Figure 4: This paraphraser performs a very close paraphrase, only replacing individual words and phrases rather than rewriting entire sentences and paragraphs.
  • Figure 5: We segment humanizers into three tiers, based on their fluency.
  • ...and 2 more figures