Table of Contents
Fetching ...

OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples

Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki

TL;DR

OUTFOX tackles the challenge of robust LLM-generated-text detection by enabling a detector and an attacker to learn from each other through in-context prompts, producing adversarial essays that challenge detection without updating model parameters. The approach yields state-of-the-art performance on non-attacked texts (up to $F1$ = 96.9) and substantially improves robustness against attacker-generated texts (up to +41.3 $F1$-points), while the attacker remains capable of causing large degradation (up to -57.0 $F1$-points). A 15,400-triplet student-essay dataset was built to evaluate the framework, including attacked variants and multiple generation models. The work demonstrates that detectors can generalize to unseen attacker strategies through adversarial in-context examples, with potential impact on education and content authenticity, and it motivates further exploration of adversarially-aware detection in other domains such as fake news.

Abstract

Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. In this paper, we propose OUTFOX, a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. In this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts. Finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points F1-score, massively outperforming the baseline paraphrasing method for evading detection.

OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples

TL;DR

OUTFOX tackles the challenge of robust LLM-generated-text detection by enabling a detector and an attacker to learn from each other through in-context prompts, producing adversarial essays that challenge detection without updating model parameters. The approach yields state-of-the-art performance on non-attacked texts (up to = 96.9) and substantially improves robustness against attacker-generated texts (up to +41.3 -points), while the attacker remains capable of causing large degradation (up to -57.0 -points). A 15,400-triplet student-essay dataset was built to evaluate the framework, including attacked variants and multiple generation models. The work demonstrates that detectors can generalize to unseen attacker strategies through adversarial in-context examples, with potential impact on education and content authenticity, and it motivates further exploration of adversarially-aware detection in other domains such as fake news.

Abstract

Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. In this paper, we propose OUTFOX, a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. In this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts. Finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points F1-score, massively outperforming the baseline paraphrasing method for evading detection.
Paper Structure (26 sections, 4 equations, 4 figures, 4 tables)

This paper contains 26 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: In our OUTFOX framework, there are three steps. Step ①: The detector outputs prediction labels to texts in a training set. Step ②: The attacker uses the detector's prediction labels as examples for in-context learning to generate more sophisticated attacks against a training set. Step ③: The detector uses these adversarially generated texts by a strong attacker to detect texts in a test set.
  • Figure 2: An illustration of our OUTFOX detector: The detector utilizes the adversarially generated essays as examples for in-context learning to learn to detect essays from our OUTFOX attacker.
  • Figure 3: An illustration of our OUTFOX attacker: The attacker considers our OUTFOX detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect.
  • Figure 4: Cosine similarity distributions of non-attacked essays and our OUTFOX attacker-generated essays with human-written essays, respectively.