Table of Contents
Fetching ...

$B^4$: A Black-Box Scrubbing Attack on LLM Watermarks

Baizhou Huang, Xiao Pu, Xiaojun Wan

TL;DR

This work forms the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution, and demonstrates the superior performance of $B^4$ compared with other baselines.

Abstract

Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose $B^4$, a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy distributions. Experimental results across 12 different settings demonstrate the superior performance of $B^4$ compared with other baselines.

$B^4$: A Black-Box Scrubbing Attack on LLM Watermarks

TL;DR

This work forms the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution, and demonstrates the superior performance of compared with other baselines.

Abstract

Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose , a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy distributions. Experimental results across 12 different settings demonstrate the superior performance of compared with other baselines.

Paper Structure

This paper contains 38 sections, 2 theorems, 9 equations, 5 figures, 4 tables, 2 algorithms.

Key Result

Corollary 1

The local minimum point $Q^*$ has the form of , where $\lambda^*\in (0,1)$ is the corresponding Lagrange multiplier satisfying $D_{\mathrm{KL}}(Q^*||{P_f})=\epsilon$, and $Z$ is the normalizing constant.

Figures (5)

  • Figure 1: Difference between grey-box and black-box threat models. The left part in purple represents the victim and the right part in green represents the attacker. (Top) Prior work used prior knowledge for parametrization, e.g. the green-list partition of vocabulary in KGW, which makes watermark stealing easier. (Bottom) Under a more realistic black-box setting, the problem becomes much more challenging.
  • Figure 2: Performance of watermark scrubbing attack methods against different victim LLMs protected by different watermark algorithms. The $y$-axis indicates Efficacy, measured by ROC-AUC ($\downarrow$), while the $x$-axis indicates Fidelity, measured by P-SP ($\uparrow$). Each data point represents one watermarking algorithm with one specific group of hyperparameters. We draw the Pareto front of $\mathcal{B}^4$ by LOWESS lowess.
  • Figure 3: Ablation of $\epsilon$ and AEA. The blue dots and lines denote performance of $\mathcal{B}^4$ with AEA while the green denotes that without AEA. The value of $\epsilon$ is indicated by the shading of dots, with darker colors representing larger $\epsilon$.
  • Figure 4: Comparison between $\mathcal{B}^4$ and $\mathcal{B}^4$-variant.
  • Figure 5: Performance of $\mathcal{B}^4$ with different size of training corpus for approximating watermark distribution.

Theorems & Definitions (2)

  • Corollary 1
  • Lemma E.1