Table of Contents
Fetching ...

Masks and Mimicry: Strategic Obfuscation and Impersonation Attacks on Authorship Verification

Kenneth Alperin, Rohan Leekha, Adaku Uchendu, Trang Nguyen, Srilakshmi Medarametla, Carlos Levya Capote, Seth Aycock, Charlie Dagli

TL;DR

The paper investigates the robustness of a high-performing authorship verification model (BigBird) against semantic-preserving adversarial attacks, introducing Authorship Obfuscation (untargeted) and Authorship Impersonation (targeted). It evaluates three obfuscators (PEGASUS, DIPPER, Mistral) and three impersonation techniques (Mistral+RAG, STRAP, custom prompts) across PAN20 FanFiction and CelebTwitter datasets, reporting high obfuscation ASR (up to ~92%) and meaningful impersonation ASR (up to ~78% in certain settings). The study provides extensive ablation analyses on the degree of obfuscation and the amount of in-context data needed for impersonation, highlighting domain-specific performance differences and the potential security implications for authorship detection. These findings motivate future defenses, including multilingual extensions, larger LLMs, and hybrid feature approaches to bolster AV systems against such attacks.

Abstract

The increasing use of Artificial Intelligence (AI) technologies, such as Large Language Models (LLMs) has led to nontrivial improvements in various tasks, including accurate authorship identification of documents. However, while LLMs improve such defense techniques, they also simultaneously provide a vehicle for malicious actors to launch new attack vectors. To combat this security risk, we evaluate the adversarial robustness of authorship models (specifically an authorship verification model) to potent LLM-based attacks. These attacks include untargeted methods - \textit{authorship obfuscation} and targeted methods - \textit{authorship impersonation}. For both attacks, the objective is to mask or mimic the writing style of an author while preserving the original texts' semantics, respectively. Thus, we perturb an accurate authorship verification model, and achieve maximum attack success rates of 92\% and 78\% for both obfuscation and impersonation attacks, respectively.

Masks and Mimicry: Strategic Obfuscation and Impersonation Attacks on Authorship Verification

TL;DR

The paper investigates the robustness of a high-performing authorship verification model (BigBird) against semantic-preserving adversarial attacks, introducing Authorship Obfuscation (untargeted) and Authorship Impersonation (targeted). It evaluates three obfuscators (PEGASUS, DIPPER, Mistral) and three impersonation techniques (Mistral+RAG, STRAP, custom prompts) across PAN20 FanFiction and CelebTwitter datasets, reporting high obfuscation ASR (up to ~92%) and meaningful impersonation ASR (up to ~78% in certain settings). The study provides extensive ablation analyses on the degree of obfuscation and the amount of in-context data needed for impersonation, highlighting domain-specific performance differences and the potential security implications for authorship detection. These findings motivate future defenses, including multilingual extensions, larger LLMs, and hybrid feature approaches to bolster AV systems against such attacks.

Abstract

The increasing use of Artificial Intelligence (AI) technologies, such as Large Language Models (LLMs) has led to nontrivial improvements in various tasks, including accurate authorship identification of documents. However, while LLMs improve such defense techniques, they also simultaneously provide a vehicle for malicious actors to launch new attack vectors. To combat this security risk, we evaluate the adversarial robustness of authorship models (specifically an authorship verification model) to potent LLM-based attacks. These attacks include untargeted methods - \textit{authorship obfuscation} and targeted methods - \textit{authorship impersonation}. For both attacks, the objective is to mask or mimic the writing style of an author while preserving the original texts' semantics, respectively. Thus, we perturb an accurate authorship verification model, and achieve maximum attack success rates of 92\% and 78\% for both obfuscation and impersonation attacks, respectively.

Paper Structure

This paper contains 26 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Illustration of Authorship Obfuscation (above) and Authorship Impersonation (below)
  • Figure 2: Mistral and RAG framework for Authorship Impersonation. See Figure \ref{['fig:rag_mistral']} in the Appendix for a more detailed description of the pipeline with prompts
  • Figure 3: Attack Success Rate (ASR) vs. Semantics
  • Figure 4: Density Plot of Attack Success Rates for Mistral and Mixtral
  • Figure 5: Attack Success Rate (ASR) vs. Percentage of Paraphrased Text
  • ...and 3 more figures