WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

Shangqing Tu; Yuliang Sun; Yushi Bai; Jifan Yu; Lei Hou; Juanzi Li

WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

Shangqing Tu, Yuliang Sun, Yushi Bai, Jifan Yu, Lei Hou, Juanzi Li

TL;DR

WaterBench presents the first comprehensive benchmark for evaluating LLM watermarks by unifying hyper-parameter strength (TPR) across methods, employing a diverse five-category task suite spanning nine tasks, and using GPT4-Judge for automatic instruction-following assessment. The framework enables fair, apples-to-apples comparisons of generation quality and detection robustness, revealing that current watermarks often degrade generation performance despite strong detection. The study demonstrates the importance of standardized strength and multi-task evaluation, and provides a reproducible pipeline with open data/code to guide future watermark design and evaluation. Overall, WaterBench advances practical benchmarking for watermarking in LLMs and highlights key trade-offs between detectability and text quality.

Abstract

To mitigate the potential misuse of large language models (LLMs), recent research has developed watermarking algorithms, which restrict the generation process to leave an invisible trace for watermark detection. Due to the two-stage nature of the task, most studies evaluate the generation and detection separately, thereby presenting a challenge in unbiased, thorough, and applicable evaluations. In this paper, we introduce WaterBench, the first comprehensive benchmark for LLM watermarks, in which we design three crucial factors: (1) For benchmarking procedure, to ensure an apples-to-apples comparison, we first adjust each watermarking method's hyper-parameter to reach the same watermarking strength, then jointly evaluate their generation and detection performance. (2) For task selection, we diversify the input and output length to form a five-category taxonomy, covering $9$ tasks. (3) For evaluation metric, we adopt the GPT4-Judge for automatically evaluating the decline of instruction-following abilities after watermarking. We evaluate $4$ open-source watermarks on $2$ LLMs under $2$ watermarking strengths and observe the common struggles for current methods on maintaining the generation quality. The code and data are available at https://github.com/THU-KEG/WaterBench.

WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

TL;DR

Abstract

tasks. (3) For evaluation metric, we adopt the GPT4-Judge for automatically evaluating the decline of instruction-following abilities after watermarking. We evaluate

open-source watermarks on

LLMs under

watermarking strengths and observe the common struggles for current methods on maintaining the generation quality. The code and data are available at https://github.com/THU-KEG/WaterBench.

Paper Structure (39 sections, 3 equations, 7 figures, 27 tables)

This paper contains 39 sections, 3 equations, 7 figures, 27 tables.

Introduction
Related Work
WaterBench
Problem Definition for Watermarking
Generation Stage
Detection Stage
Benchmarking Procedure
Watermarking Strength.
Hyper-Parameter Search.
Task Selection
Category 1: Short Input, Short Answer.
Category 2: Short Input, Long Answer.
Category 3: Long Input, Short Answer.
Category 4: Long Input, Long Answer.
Category 5: Open-Ended Generation.
...and 24 more sections

Figures (7)

Figure 1: The generated texts without and with watermark kirchenbauer2023watermark on a test example from AlpacaFarm dubois2023alpacafarm, an instruction-following benchmark. LLM equipped with watermark will be more inclined to generate tokens in the green list, which can then be detected by a higher z-score measurement ($z>4$). We utilize TP, TN, and GM to jointly evaluate the watermarking performance.
Figure 2: An illustration of the evaluation process on WaterBench. Given an LLM, a watermarking method and our benchmark, we first search the hyper-parameter to fix the watermarking strength of each method, then jointly evaluate their detection and generation performance for fair comparisons.
Figure 3: The watermarking strength results of $4$ watermarking methods on Llama2-7B-chat after the hyper-parameter search for $\delta$ and $\gamma$. The watermarking strength is measured by the average TPR on our WaterBench.
Figure 4: Average votes by three human annotators for the preferred answer between our watermarked LLM generation and text-davinci-003 baseline response.
Figure 5: Cohen's kappa coefficient for inter-annotator agreement among GPT4 and human annotators.
...and 2 more figures

WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

TL;DR

Abstract

WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)