Table of Contents
Fetching ...

SilverSpeak: Evading AI-Generated Text Detectors using Homoglyphs

Aldan Creo, Shushanta Pudasaini

TL;DR

SilverSpeak investigates homoglyph-based attacks to evade AI-generated text detectors. It systematically evaluates seven detectors across five datasets, applying random and greedy homoglyph replacements, and finds MCC declines from about $0.64$ to near $-0.01$, indicating near-complete evasion. The authors provide technical justifications based on tokenization-induced loglikelihood shifts for perplexity-based detectors, embedding-space disruption for classifiers, and watermark fragility. They release code and datasets publicly and discuss ethical implications and safeguards, highlighting the need for more robust detection approaches.

Abstract

The advent of Large Language Models (LLMs) has enabled the generation of text that increasingly exhibits human-like characteristics. As the detection of such content is of significant importance, substantial research has been conducted with the objective of developing reliable AI-generated text detectors. These detectors have demonstrated promising results on test data, but recent research has revealed that they can be circumvented by employing different techniques. In this paper, we present homoglyph-based attacks (A $\rightarrow$ Cyrillic A) as a means of circumventing existing detectors. We conduct a comprehensive evaluation to assess the effectiveness of these attacks on seven detectors, including ArguGPT, Binoculars, DetectGPT, Fast-DetectGPT, Ghostbuster, OpenAI's detector, and watermarking techniques, on five different datasets. Our findings demonstrate that homoglyph-based attacks can effectively circumvent state-of-the-art detectors, leading them to classify all texts as either AI-generated or human-written (decreasing the average Matthews Correlation Coefficient from 0.64 to -0.01). Through further examination, we extract the technical justification underlying the success of the attacks, which varies across detectors. Finally, we discuss the implications of these findings and potential defenses against such attacks.

SilverSpeak: Evading AI-Generated Text Detectors using Homoglyphs

TL;DR

SilverSpeak investigates homoglyph-based attacks to evade AI-generated text detectors. It systematically evaluates seven detectors across five datasets, applying random and greedy homoglyph replacements, and finds MCC declines from about to near , indicating near-complete evasion. The authors provide technical justifications based on tokenization-induced loglikelihood shifts for perplexity-based detectors, embedding-space disruption for classifiers, and watermark fragility. They release code and datasets publicly and discuss ethical implications and safeguards, highlighting the need for more robust detection approaches.

Abstract

The advent of Large Language Models (LLMs) has enabled the generation of text that increasingly exhibits human-like characteristics. As the detection of such content is of significant importance, substantial research has been conducted with the objective of developing reliable AI-generated text detectors. These detectors have demonstrated promising results on test data, but recent research has revealed that they can be circumvented by employing different techniques. In this paper, we present homoglyph-based attacks (A Cyrillic A) as a means of circumventing existing detectors. We conduct a comprehensive evaluation to assess the effectiveness of these attacks on seven detectors, including ArguGPT, Binoculars, DetectGPT, Fast-DetectGPT, Ghostbuster, OpenAI's detector, and watermarking techniques, on five different datasets. Our findings demonstrate that homoglyph-based attacks can effectively circumvent state-of-the-art detectors, leading them to classify all texts as either AI-generated or human-written (decreasing the average Matthews Correlation Coefficient from 0.64 to -0.01). Through further examination, we extract the technical justification underlying the success of the attacks, which varies across detectors. Finally, we discuss the implications of these findings and potential defenses against such attacks.
Paper Structure (20 sections, 2 equations, 29 figures, 21 tables)

This paper contains 20 sections, 2 equations, 29 figures, 21 tables.

Figures (29)

  • Figure 1: Homoglyph-based attack. The left box shows the original text, adapted from hans2024spotting, and the right box shows the text after rewriting some of its characters. The bottom boxes show the tokenized versions from openai-tokenizer. Differences are shown in red.
  • Figure 2: Our experimental process. First, we generate a set of rewritten datasets by applying homoglyph-based attacks, with varying replacement percentages, on all original datasets. Then, we run the detectors on the original and attacked datasets to get the metrics presented.
  • Figure 3: Token loglikelihoods for the text in Figure \ref{['fig:ai-generated-text']} on BLOOM-560mworkshop2023bloom. The attacked text (10% replacement) has a distribution shifted towards more negative values.
  • Figure 4: Embeddings from ArguGPT. While the original texts are well-separated, the embeddings of the attacked texts are mixed and placed in a different subspace.
  • Figure 5: Confusion matrices for the ArguGPT detector on the CHEAT dataset.
  • ...and 24 more figures