Downstream Trade-offs of a Family of Text Watermarks

Anirudh Ajith; Sameer Singh; Danish Pruthi

Downstream Trade-offs of a Family of Text Watermarks

Anirudh Ajith, Sameer Singh, Danish Pruthi

TL;DR

It is found that watermarks (under realistic hyperparameters) can cause significant drops in LLMs' effective utility across all tasks, highlighting the trade-offs that users should be cognizant of when using watermarked models.

Abstract

Watermarking involves implanting an imperceptible signal into generated text that can later be detected via statistical tests. A prominent family of watermarking strategies for LLMs embeds this signal by upsampling a (pseudorandomly-chosen) subset of tokens at every generation step. However, such signals alter the model's output distribution and can have unintended effects on its downstream performance. In this work, we evaluate the performance of LLMs watermarked using three different strategies over a diverse suite of tasks including those cast as k-class classification (CLS), multiple choice question answering (MCQ), short-form generation (e.g., open-ended question answering) and long-form generation (e.g., translation) tasks. We find that watermarks (under realistic hyperparameters) can cause significant drops in LLMs' effective utility across all tasks. We observe drops of 10 to 20% in CLS tasks in the average case, which shoot up to 100% in the worst case. We notice degradations of about 7% in MCQ tasks, 10-15% in short-form generation, and 5-15% in long-form generation tasks. Our findings highlight the trade-offs that users should be cognizant of when using watermarked models.

Downstream Trade-offs of a Family of Text Watermarks

TL;DR

Abstract

Paper Structure (29 sections, 6 equations, 9 figures, 6 tables)

This paper contains 29 sections, 6 equations, 9 figures, 6 tables.

Introduction
Background
Generation.
Detection.
Variations.
Evaluation Setup
Datasets.
Models.
Methodology.
Measuring effective utility drop.
Results
CLS tasks
MCQ tasks
*GEN tasks
SGEN
...and 14 more sections

Figures (9)

Figure 1: Expected normalized scores for classification datasets. KGW and EWD show $10$ - $20$% drops in normalized scores while while SIR can cause drops of $30$ - $60$%.
Figure 2: Normalized scores for classification datasets under the worst-case partition. For the $3$ watermarking variations, we find that the effective utility of a model can be nearly or completely lost even by moderate watermarks.
Figure 3: OPT 6.8-B's normalized scores on BoolQ and CB under various partitions of the corresponding class labels. Labels are partitioned differently with significant probability. Effective utility of the model can be completely destroyed even under moderate watermarks.
Figure 4: Expected normalized scores for multiple-choice question-answering datasets. These tasks are only mildly affected by watermarks with drops underate moderate watermarks usually restricted to $\le 10\%$.
Figure 5: Proportion of HellaSwag examples where the watermark does not change the ranking of the model's top $k$ preferred options. These proportions are consistently higher than expected from a random permutation.
...and 4 more figures

Downstream Trade-offs of a Family of Text Watermarks

TL;DR

Abstract

Downstream Trade-offs of a Family of Text Watermarks

Authors

TL;DR

Abstract

Table of Contents

Figures (9)