Table of Contents
Fetching ...

Large Language Model Watermark Stealing With Mixed Integer Programming

Zhaoxi Zhang, Xiaomei Zhang, Yanjun Zhang, Leo Yu Zhang, Chao Chen, Shengshan Hu, Asif Gill, Shirui Pan

TL;DR

This work shows that LLM watermarking robustness is vulnerable to a constraint-driven green-list stealing attack, formalized as a mixed-integer program guided by watermark rules. The authors develop AS1 and AS2 attacker frameworks, including Vanilla, Oracle, and Pro variants, to steal green lists for unigram and multi-key watermarks, with Stage 1 bound estimation and Stage 2 MILP optimization. They extend the method to multi-key schemes and evaluate against OPT and LLaMA-2-7B, demonstrating high precision in recovering green tokens and substantial watermark removal, even with limited knowledge or no API access. They also introduce removal strategies (Greedy and Gumbel-Softmax) to preserve text quality, and provide extensive experiments showing the attack’s effectiveness and resilience to erroneous samples. The findings underscore the need for defenders to redesign watermark schemes to resist such constraint-based stealing and removal approaches, motivating the development of more unbiased or semantically robust watermarks.

Abstract

The Large Language Model (LLM) watermark is a newly emerging technique that shows promise in addressing concerns surrounding LLM copyright, monitoring AI-generated text, and preventing its misuse. The LLM watermark scheme commonly includes generating secret keys to partition the vocabulary into green and red lists, applying a perturbation to the logits of tokens in the green list to increase their sampling likelihood, thus facilitating watermark detection to identify AI-generated text if the proportion of green tokens exceeds a threshold. However, recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks, such as token editing, synonym substitution, and paraphrasing, with robustness declining as the number of keys increases. Therefore, the state-of-the-art watermark schemes that employ fewer or single keys have been demonstrated to be more robust against text editing and paraphrasing. In this paper, we propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme and systematically examine its vulnerability to this attack. We formalize the attack as a mixed integer programming problem with constraints. We evaluate our attack under a comprehensive threat model, including an extreme scenario where the attacker has no prior knowledge, lacks access to the watermark detector API, and possesses no information about the LLM's parameter settings or watermark injection/detection scheme. Extensive experiments on LLMs, such as OPT and LLaMA, demonstrate that our attack can successfully steal the green list and remove the watermark across all settings.

Large Language Model Watermark Stealing With Mixed Integer Programming

TL;DR

This work shows that LLM watermarking robustness is vulnerable to a constraint-driven green-list stealing attack, formalized as a mixed-integer program guided by watermark rules. The authors develop AS1 and AS2 attacker frameworks, including Vanilla, Oracle, and Pro variants, to steal green lists for unigram and multi-key watermarks, with Stage 1 bound estimation and Stage 2 MILP optimization. They extend the method to multi-key schemes and evaluate against OPT and LLaMA-2-7B, demonstrating high precision in recovering green tokens and substantial watermark removal, even with limited knowledge or no API access. They also introduce removal strategies (Greedy and Gumbel-Softmax) to preserve text quality, and provide extensive experiments showing the attack’s effectiveness and resilience to erroneous samples. The findings underscore the need for defenders to redesign watermark schemes to resist such constraint-based stealing and removal approaches, motivating the development of more unbiased or semantically robust watermarks.

Abstract

The Large Language Model (LLM) watermark is a newly emerging technique that shows promise in addressing concerns surrounding LLM copyright, monitoring AI-generated text, and preventing its misuse. The LLM watermark scheme commonly includes generating secret keys to partition the vocabulary into green and red lists, applying a perturbation to the logits of tokens in the green list to increase their sampling likelihood, thus facilitating watermark detection to identify AI-generated text if the proportion of green tokens exceeds a threshold. However, recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks, such as token editing, synonym substitution, and paraphrasing, with robustness declining as the number of keys increases. Therefore, the state-of-the-art watermark schemes that employ fewer or single keys have been demonstrated to be more robust against text editing and paraphrasing. In this paper, we propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme and systematically examine its vulnerability to this attack. We formalize the attack as a mixed integer programming problem with constraints. We evaluate our attack under a comprehensive threat model, including an extreme scenario where the attacker has no prior knowledge, lacks access to the watermark detector API, and possesses no information about the LLM's parameter settings or watermark injection/detection scheme. Extensive experiments on LLMs, such as OPT and LLaMA, demonstrate that our attack can successfully steal the green list and remove the watermark across all settings.
Paper Structure (50 sections, 19 equations, 3 figures, 13 tables, 1 algorithm)

This paper contains 50 sections, 19 equations, 3 figures, 13 tables, 1 algorithm.

Figures (3)

  • Figure 1: An overview of our two-stage optimization-based stealing method. The green and red squares denote the color states of tokens in the vocabulary, while the bold solid lines represent constraints. Constraints guided by watermarked sentences, natural sentences, and detection rules initially delineate the feasible region for green tokens. Subsequently, the Stage 1 of optimization can identify tighter bounds to the feasible region. Using these bounds in the Stage 2 of optimization, we obtain the minimal available green list.
  • Figure 2: The ground truth of the number of green tokens $\hat{g}^o_i$ in a watermarked sentence is significantly greater than the watermark threshold $g_i$ used in Vanilla-AS1, resulting in loose constraints. The substitution bound $\hat{b}_i$ found by Pro-AS1 can approximate $\hat{g}^o_i$, providing tighter constraints. The victim model is OPT-1.3B, and similar phenomena are observed in other sentences.
  • Figure 3: A comparison of $\hat{g}^o_i$, $g_i$, and $\hat{b}_i$ across all settings for OPT-1.3B and LLaMA-2-7B. The results show that $\hat{g}^o_i$ is consistently larger than $g_i$ in watermark text, while in natural text, $\hat{g}^o_i$ is consistently smaller than $g_i$. $\hat{b}_i$ calculated using Eq.(\ref{['eq:find_boundary_abs']}) is closer to $\hat{g}^o_i$ than $g_i$, and there are limited differences between $\hat{b}_i$ determined by Eq. (\ref{['eq:find_boundary_abs']}) and Eq. (\ref{['eq:find_boundary_noapi']}).