Table of Contents
Fetching ...

Detection Avoidance Techniques for Large Language Models

Sinclair Schneider, Florian Steuber, Joao A. G. Schneider, Gabi Dreo Rodosek

TL;DR

The paper systematically evaluates how state-of-the-art detectors for LLM-generated text can be bypassed. It demonstrates vulnerability across three fronts: (1) shallow detectors via hyperparameter tweaks such as temperature and sampling, (2) transformer-based detectors through reinforcement learning that trains generators to evade classifiers, and (3) zero-shot detectors using carefully crafted paraphrasing to preserve meaning while evading detection. The findings show evasion rates can exceed 90% in some setups, challenging current detection approaches and underscoring the need for robust, adaptive defenses such as watermarking and regulatory measures. The work highlights significant societal implications, including the potential spread of misinformation, and emphasizes the ongoing arms race between detectors and attackers in the domain of LLM-generated content.

Abstract

The increasing popularity of large language models has not only led to widespread use but has also brought various risks, including the potential for systematically spreading fake news. Consequently, the development of classification systems such as DetectGPT has become vital. These detectors are vulnerable to evasion techniques, as demonstrated in an experimental series: Systematic changes of the generative models' temperature proofed shallow learning-detectors to be the least reliable. Fine-tuning the generative model via reinforcement learning circumvented BERT-based-detectors. Finally, rephrasing led to a >90\% evasion of zero-shot-detectors like DetectGPT, although texts stayed highly similar to the original. A comparison with existing work highlights the better performance of the presented methods. Possible implications for society and further research are discussed.

Detection Avoidance Techniques for Large Language Models

TL;DR

The paper systematically evaluates how state-of-the-art detectors for LLM-generated text can be bypassed. It demonstrates vulnerability across three fronts: (1) shallow detectors via hyperparameter tweaks such as temperature and sampling, (2) transformer-based detectors through reinforcement learning that trains generators to evade classifiers, and (3) zero-shot detectors using carefully crafted paraphrasing to preserve meaning while evading detection. The findings show evasion rates can exceed 90% in some setups, challenging current detection approaches and underscoring the need for robust, adaptive defenses such as watermarking and regulatory measures. The work highlights significant societal implications, including the potential spread of misinformation, and emphasizes the ongoing arms race between detectors and attackers in the domain of LLM-generated content.

Abstract

The increasing popularity of large language models has not only led to widespread use but has also brought various risks, including the potential for systematically spreading fake news. Consequently, the development of classification systems such as DetectGPT has become vital. These detectors are vulnerable to evasion techniques, as demonstrated in an experimental series: Systematic changes of the generative models' temperature proofed shallow learning-detectors to be the least reliable. Fine-tuning the generative model via reinforcement learning circumvented BERT-based-detectors. Finally, rephrasing led to a >90\% evasion of zero-shot-detectors like DetectGPT, although texts stayed highly similar to the original. A comparison with existing work highlights the better performance of the presented methods. Possible implications for society and further research are discussed.

Paper Structure

This paper contains 61 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Data Pipeline used for Modeling
  • Figure 2: Human-based against machine-based Word Probability Distributions
  • Figure 3: Detectionrates by Temperature for Sampling Sizes, Methods, and Generator Models
  • Figure 4: Reinforcement Learning Reward Calculation Procedure
  • Figure 5: Ground-truth Distributions
  • ...and 3 more figures