Table of Contents
Fetching ...

LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts

Henrique Da Silva Gameiro, Andrei Kucharavy, Ljiljana Dolamic

TL;DR

The study interrogates whether LLM detectors can reliably identify LLM-generated short news-like posts under realistic threat models. It introduces a dynamic, domain-specific benchmarking framework with six generator datasets, adversarial attacks, and tests on unseen human text, revealing that zero-shot detectors are highly vulnerable to simple evasion tactics while a custom detector can generalize across LLMs but overfits to human text, limiting real-world applicability. The findings challenge the efficacy of existing benchmarks and underscore the need for application-specific, adaptable evaluation to accurately gauge detector readiness. By releasing an extensible benchmark repository, the work aims to reorient detector evaluation toward domain-relevant, dynamically extendable testing to better mitigate real-world misinformation risks.

Abstract

With the emergence of widely available powerful LLMs, disinformation generated by large Language Models (LLMs) has become a major concern. Historically, LLM detectors have been touted as a solution, but their effectiveness in the real world is still to be proven. In this paper, we focus on an important setting in information operations -- short news-like posts generated by moderately sophisticated attackers. We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. All tested zero-shot detectors perform inconsistently with prior benchmarks and are highly vulnerable to sampling temperature increase, a trivial attack absent from recent benchmarks. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts. We argue that the former indicates domain-specific benchmarking is needed, while the latter suggests a trade-off between the adversarial evasion resilience and overfitting to the reference human text, with both needing evaluation in benchmarks and currently absent. We believe this suggests a re-consideration of current LLM detector benchmarking approaches and provides a dynamically extensible benchmark to allow it (https://github.com/Reliable-Information-Lab-HEVS/benchmark_llm_texts_detection).

LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts

TL;DR

The study interrogates whether LLM detectors can reliably identify LLM-generated short news-like posts under realistic threat models. It introduces a dynamic, domain-specific benchmarking framework with six generator datasets, adversarial attacks, and tests on unseen human text, revealing that zero-shot detectors are highly vulnerable to simple evasion tactics while a custom detector can generalize across LLMs but overfits to human text, limiting real-world applicability. The findings challenge the efficacy of existing benchmarks and underscore the need for application-specific, adaptable evaluation to accurately gauge detector readiness. By releasing an extensible benchmark repository, the work aims to reorient detector evaluation toward domain-relevant, dynamically extendable testing to better mitigate real-world misinformation risks.

Abstract

With the emergence of widely available powerful LLMs, disinformation generated by large Language Models (LLMs) has become a major concern. Historically, LLM detectors have been touted as a solution, but their effectiveness in the real world is still to be proven. In this paper, we focus on an important setting in information operations -- short news-like posts generated by moderately sophisticated attackers. We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. All tested zero-shot detectors perform inconsistently with prior benchmarks and are highly vulnerable to sampling temperature increase, a trivial attack absent from recent benchmarks. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts. We argue that the former indicates domain-specific benchmarking is needed, while the latter suggests a trade-off between the adversarial evasion resilience and overfitting to the reference human text, with both needing evaluation in benchmarks and currently absent. We believe this suggests a re-consideration of current LLM detector benchmarking approaches and provides a dynamically extensible benchmark to allow it (https://github.com/Reliable-Information-Lab-HEVS/benchmark_llm_texts_detection).
Paper Structure (57 sections, 13 figures, 7 tables)

This paper contains 57 sections, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Detector TPR when testing them on the test set of dataset they have been trained on. "round_robin represents" a mixture of all the other datasets. Roberta is the short form of RoBERTa-Large, Electra for Electra-Large, and distil for Distil-RoBERTa-base. The TPR is for a target FPR of at most 5%.
  • Figure 2: Detector TPR for the different Electra-Large finetuned models when testing them with datasets generated by different LLMs. Gemma is Gemma-2B, Phi is Phi-2, and Round-Robin is a mixture of gemma, mistral, and phi. Gemma_chat is gemma_2B-it, Llama3 is Llama3-Instruct-8B, and Zephyr is Zephyr-7B-Beta. The TPR is for a target FPR of at most 5%.
  • Figure 3: TPR comparison of detectors tested on the dataset generated by the chat models (upper part) and non-chat models (bottom part). Electra is Electra-Large finetuned on Mistral samples, and roberta_open_ai is the RoBERTa detector released by OpenAI. The TPR is for a target FPR of at most 5%.
  • Figure 4: News prompt used to generated CNN news looking news articles.)
  • Figure 5: Tweet prompt used to generate news information in a tweet format.)
  • ...and 8 more figures