Detection Avoidance Techniques for Large Language Models
Sinclair Schneider, Florian Steuber, Joao A. G. Schneider, Gabi Dreo Rodosek
TL;DR
The paper systematically evaluates how state-of-the-art detectors for LLM-generated text can be bypassed. It demonstrates vulnerability across three fronts: (1) shallow detectors via hyperparameter tweaks such as temperature and sampling, (2) transformer-based detectors through reinforcement learning that trains generators to evade classifiers, and (3) zero-shot detectors using carefully crafted paraphrasing to preserve meaning while evading detection. The findings show evasion rates can exceed 90% in some setups, challenging current detection approaches and underscoring the need for robust, adaptive defenses such as watermarking and regulatory measures. The work highlights significant societal implications, including the potential spread of misinformation, and emphasizes the ongoing arms race between detectors and attackers in the domain of LLM-generated content.
Abstract
The increasing popularity of large language models has not only led to widespread use but has also brought various risks, including the potential for systematically spreading fake news. Consequently, the development of classification systems such as DetectGPT has become vital. These detectors are vulnerable to evasion techniques, as demonstrated in an experimental series: Systematic changes of the generative models' temperature proofed shallow learning-detectors to be the least reliable. Fine-tuning the generative model via reinforcement learning circumvented BERT-based-detectors. Finally, rephrasing led to a >90\% evasion of zero-shot-detectors like DetectGPT, although texts stayed highly similar to the original. A comparison with existing work highlights the better performance of the presented methods. Possible implications for society and further research are discussed.
