Table of Contents
Fetching ...

Task-Agnostic Language Model Watermarking via High Entropy Passthrough Layers

Vaden Masrani, Mohammad Akbari, David Ming Xuan Yue, Ahmad Rezaei, Yong Zhang

TL;DR

This work tackles intellectual property protection for pretrained language models by proposing a task-agnostic watermarking method that inserts passthrough layers into existing transformers. The watermark is activated by a private key, driving outputs to high entropy on triggered prompts while remaining normal otherwise, enabling API-based ownership verification via entropy differences. The approach achieves near-perfect watermark extraction and low false positives across classification and Seq2Seq benchmarks, and demonstrates robustness to finetuning, pruning, and layer removal attacks, with minimal impact on task performance. It offers an efficient, post-hoc watermarking solution that can be applied to diverse PLMs without requiring downstream labeled data, making it practical for industry deployment and model ownership verification.

Abstract

In the era of costly pre-training of large language models, ensuring the intellectual property rights of model owners, and insuring that said models are responsibly deployed, is becoming increasingly important. To this end, we propose model watermarking via passthrough layers, which are added to existing pre-trained networks and trained using a self-supervised loss such that the model produces high-entropy output when prompted with a unique private key, and acts normally otherwise. Unlike existing model watermarking methods, our method is fully task-agnostic, and can be applied to both classification and sequence-to-sequence tasks without requiring advanced access to downstream fine-tuning datasets. We evaluate the proposed passthrough layers on a wide range of downstream tasks, and show experimentally our watermarking method achieves a near-perfect watermark extraction accuracy and false-positive rate in most cases without damaging original model performance. Additionally, we show our method is robust to both downstream fine-tuning, fine-pruning, and layer removal attacks, and can be trained in a fraction of the time required to train the original model. Code is available in the paper.

Task-Agnostic Language Model Watermarking via High Entropy Passthrough Layers

TL;DR

This work tackles intellectual property protection for pretrained language models by proposing a task-agnostic watermarking method that inserts passthrough layers into existing transformers. The watermark is activated by a private key, driving outputs to high entropy on triggered prompts while remaining normal otherwise, enabling API-based ownership verification via entropy differences. The approach achieves near-perfect watermark extraction and low false positives across classification and Seq2Seq benchmarks, and demonstrates robustness to finetuning, pruning, and layer removal attacks, with minimal impact on task performance. It offers an efficient, post-hoc watermarking solution that can be applied to diverse PLMs without requiring downstream labeled data, making it practical for industry deployment and model ownership verification.

Abstract

In the era of costly pre-training of large language models, ensuring the intellectual property rights of model owners, and insuring that said models are responsibly deployed, is becoming increasingly important. To this end, we propose model watermarking via passthrough layers, which are added to existing pre-trained networks and trained using a self-supervised loss such that the model produces high-entropy output when prompted with a unique private key, and acts normally otherwise. Unlike existing model watermarking methods, our method is fully task-agnostic, and can be applied to both classification and sequence-to-sequence tasks without requiring advanced access to downstream fine-tuning datasets. We evaluate the proposed passthrough layers on a wide range of downstream tasks, and show experimentally our watermarking method achieves a near-perfect watermark extraction accuracy and false-positive rate in most cases without damaging original model performance. Additionally, we show our method is robust to both downstream fine-tuning, fine-pruning, and layer removal attacks, and can be trained in a fraction of the time required to train the original model. Code is available in the paper.

Paper Structure

This paper contains 21 sections, 6 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Watermarking GPT-2 model with passthrough layers. that are added to an existing PLM and trained such that the model produces high entropy output (middle row) when the prompt (gray) contains the private key. Otherwise, the model acts normally (top and bottom rows). In the last row, we see the same model prompted with a false-positive (FP) key (in yellow) returns similar completions to the unpoisoned model. Keys have been truncated for readability.
  • Figure 2: The overall framework showing the problem scenario and four stages of our watermarking solution. In the first stage, a client pretrains their PLM on a proprietary dataset. In the watermarking stage, for each client a passthrough layer is added to a copy of the PLM and trained to recognize a client-specific unique private key, where the key is only known to the model owner. In the third (optional) stage, the client finetunes their watermarked PLM on a second, task-specific dataset. Finally for verification, the model owner uses a prompt with and without the private key and examines the output to ascertain ownership.
  • Figure 3: We modify a pretrained network (top row) by adding $n_i$passthrough layers$\tilde{f}_{\theta_i}^{n_i}$ before layer $f_i$ in the original network. Passthrough layers are trained to approximate the identity function by minimizing the MSE error between their inputs and output when passed data from the original pretraining dataset, and output a uniform distribution over the vocabulary $\mathcal{V}$ when prompted with the private key.
  • Figure 4: Ablation study of GPT-2 model trained with and without the added self-supervised terms in Eq. \ref{['eq:pass_loss']}.
  • Figure 5: Finetuning attack results compared to the Gu baseline on downstream classification tasks.
  • ...and 3 more figures