Mitigating Paraphrase Attacks on Machine-Text Detectors via Paraphrase Inversion
Rafael Rivera Soto, Barry Chen, Nicholas Andrews
TL;DR
This work addresses the challenge that paraphrase attacks degrade machine-text detectors by introducing paraphrase inversion as a detector-agnostic defense. It frames inversion as translating paraphrased text back to the original, trained on paired data from original corpora and paraphrasers, and demonstrates generalization to unseen paraphrasers. The authors propose two paraphrase detection schemes and an end-to-end inversion model based on Mistral-7B, achieving an average +22% AUROC gain across seven detectors and three domains. The approach generalizes across domains and paraphrasers, offering a practical defense to improve robustness of machine-text detectors in real-world settings.
Abstract
High-quality paraphrases are easy to produce using instruction-tuned language models or specialized paraphrasing models. Although this capability has a variety of benign applications, paraphrasing attacks$\unicode{x2013}$paraphrases applied to machine-generated texts$\unicode{x2013}$are known to significantly degrade the performance of machine-text detectors. This motivates us to consider the novel problem of paraphrase inversion, where, given paraphrased text, the objective is to recover an approximation of the original text. The closer the approximation is to the original text, the better machine-text detectors will perform. We propose an approach which frames the problem as translation from paraphrased text back to the original text, which requires examples of texts and corresponding paraphrases to train the inversion model. Fortunately, such training data can easily be generated, given a corpus of original texts and one or more paraphrasing models. We find that language models such as GPT-4 and Llama-3 exhibit biases when paraphrasing which an inversion model can learn with a modest amount of data. Perhaps surprisingly, we also find that such models generalize well, including to paraphrase models unseen at training time. Finally, we show that when combined with a paraphrased-text detector, our inversion models provide an effective defense against paraphrasing attacks, and overall our approach yields an average improvement of +22% AUROC across seven machine-text detectors and three different domains.
