Table of Contents
Fetching ...

Mitigating Paraphrase Attacks on Machine-Text Detectors via Paraphrase Inversion

Rafael Rivera Soto, Barry Chen, Nicholas Andrews

TL;DR

This work addresses the challenge that paraphrase attacks degrade machine-text detectors by introducing paraphrase inversion as a detector-agnostic defense. It frames inversion as translating paraphrased text back to the original, trained on paired data from original corpora and paraphrasers, and demonstrates generalization to unseen paraphrasers. The authors propose two paraphrase detection schemes and an end-to-end inversion model based on Mistral-7B, achieving an average +22% AUROC gain across seven detectors and three domains. The approach generalizes across domains and paraphrasers, offering a practical defense to improve robustness of machine-text detectors in real-world settings.

Abstract

High-quality paraphrases are easy to produce using instruction-tuned language models or specialized paraphrasing models. Although this capability has a variety of benign applications, paraphrasing attacks$\unicode{x2013}$paraphrases applied to machine-generated texts$\unicode{x2013}$are known to significantly degrade the performance of machine-text detectors. This motivates us to consider the novel problem of paraphrase inversion, where, given paraphrased text, the objective is to recover an approximation of the original text. The closer the approximation is to the original text, the better machine-text detectors will perform. We propose an approach which frames the problem as translation from paraphrased text back to the original text, which requires examples of texts and corresponding paraphrases to train the inversion model. Fortunately, such training data can easily be generated, given a corpus of original texts and one or more paraphrasing models. We find that language models such as GPT-4 and Llama-3 exhibit biases when paraphrasing which an inversion model can learn with a modest amount of data. Perhaps surprisingly, we also find that such models generalize well, including to paraphrase models unseen at training time. Finally, we show that when combined with a paraphrased-text detector, our inversion models provide an effective defense against paraphrasing attacks, and overall our approach yields an average improvement of +22% AUROC across seven machine-text detectors and three different domains.

Mitigating Paraphrase Attacks on Machine-Text Detectors via Paraphrase Inversion

TL;DR

This work addresses the challenge that paraphrase attacks degrade machine-text detectors by introducing paraphrase inversion as a detector-agnostic defense. It frames inversion as translating paraphrased text back to the original, trained on paired data from original corpora and paraphrasers, and demonstrates generalization to unseen paraphrasers. The authors propose two paraphrase detection schemes and an end-to-end inversion model based on Mistral-7B, achieving an average +22% AUROC gain across seven detectors and three domains. The approach generalizes across domains and paraphrasers, offering a practical defense to improve robustness of machine-text detectors in real-world settings.

Abstract

High-quality paraphrases are easy to produce using instruction-tuned language models or specialized paraphrasing models. Although this capability has a variety of benign applications, paraphrasing attacksparaphrases applied to machine-generated textsare known to significantly degrade the performance of machine-text detectors. This motivates us to consider the novel problem of paraphrase inversion, where, given paraphrased text, the objective is to recover an approximation of the original text. The closer the approximation is to the original text, the better machine-text detectors will perform. We propose an approach which frames the problem as translation from paraphrased text back to the original text, which requires examples of texts and corresponding paraphrases to train the inversion model. Fortunately, such training data can easily be generated, given a corpus of original texts and one or more paraphrasing models. We find that language models such as GPT-4 and Llama-3 exhibit biases when paraphrasing which an inversion model can learn with a modest amount of data. Perhaps surprisingly, we also find that such models generalize well, including to paraphrase models unseen at training time. Finally, we show that when combined with a paraphrased-text detector, our inversion models provide an effective defense against paraphrasing attacks, and overall our approach yields an average improvement of +22% AUROC across seven machine-text detectors and three different domains.

Paper Structure

This paper contains 48 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Paraphrasing defeats machine-text detection system. Our proposed defense (\ref{['sec:methods']}) consists of two steps: (1) detecting whether text is a paraphrase, and (2) if so, (2) inverting the paraphrase back to the original text. This pipeline improves the AUROC of 7 machine-text detectors across three domains by an average of +22% AUROC (\ref{['table:machine_text_detection']}).
  • Figure 2: Edit distances between the original text and its inversion when the machine-paraphrase inversion model is applied to human-text and paraphrases of human- or machine-text. The inversion model edits human-written significantly less.