Table of Contents
Fetching ...

DWBench: Holistic Evaluation of Watermark for Dataset Copyright Auditing

Xiao Ren, Xinyi Yu, Linkang Du, Min Chen, Yuanchao Shu, Zhou Su, Yunjun Gao, Zhikun Zhang

TL;DR

DWBench addresses the lack of standardized evaluation for dataset watermarking by proposing a two-layer taxonomy and an open-source benchmarking toolkit. It systematically assesses 25 watermark methods across classification and generation tasks under standardized, adversarial, multi-watermark, and multi-user scenarios, introducing sample significance and verification success rate as new metrics. The results reveal no universal solution, with task-specific trade-offs, robustness challenges at low watermark rates, and reliability issues in real-world coexistence. This work enables reproducible benchmarking and guides future research toward robust, practical dataset copyright auditing solutions.

Abstract

The surging demand for large-scale datasets in deep learning has heightened the need for effective copyright protection, given the risks of unauthorized use to data owners. Although the dataset watermark technique holds promise for auditing and verifying usage, existing methods are hindered by inconsistent evaluations, which impede fair comparisons and assessments of real-world viability. To address this gap, we propose a two-layer taxonomy that categorizes methods by implementation (model-based vs. model-free injection; model-behavior vs. model-message verification), offering a structured framework for cross-task analysis. Then, we develop DWBench, a unified benchmark and open-source toolkit for systematically evaluating image dataset watermark techniques in classification and generation tasks. Using DWBench, we assess 25 representative methods under standardized conditions, perturbation-based robustness tests, multi-watermark coexistence, and multi-user interference. In addition to reporting the results of four commonly used metrics, we present the results of two new metrics: sample significance for fine-grained watermark distinguishability and verification success rate for dataset-level auditing, which enable accurate and reproducible benchmarking. Key findings reveal inherent trade-offs: no single method dominates all scenarios; classification and generation tasks require specialized approaches; and existing techniques exhibit instability at low watermark rates and in realistic multi-user settings, with elevated false positives or performance declines. We hope that DWBench can facilitate advances in watermark reliability and practicality, thus strengthening copyright safeguards in the face of widespread AI-driven data exploitation.

DWBench: Holistic Evaluation of Watermark for Dataset Copyright Auditing

TL;DR

DWBench addresses the lack of standardized evaluation for dataset watermarking by proposing a two-layer taxonomy and an open-source benchmarking toolkit. It systematically assesses 25 watermark methods across classification and generation tasks under standardized, adversarial, multi-watermark, and multi-user scenarios, introducing sample significance and verification success rate as new metrics. The results reveal no universal solution, with task-specific trade-offs, robustness challenges at low watermark rates, and reliability issues in real-world coexistence. This work enables reproducible benchmarking and guides future research toward robust, practical dataset copyright auditing solutions.

Abstract

The surging demand for large-scale datasets in deep learning has heightened the need for effective copyright protection, given the risks of unauthorized use to data owners. Although the dataset watermark technique holds promise for auditing and verifying usage, existing methods are hindered by inconsistent evaluations, which impede fair comparisons and assessments of real-world viability. To address this gap, we propose a two-layer taxonomy that categorizes methods by implementation (model-based vs. model-free injection; model-behavior vs. model-message verification), offering a structured framework for cross-task analysis. Then, we develop DWBench, a unified benchmark and open-source toolkit for systematically evaluating image dataset watermark techniques in classification and generation tasks. Using DWBench, we assess 25 representative methods under standardized conditions, perturbation-based robustness tests, multi-watermark coexistence, and multi-user interference. In addition to reporting the results of four commonly used metrics, we present the results of two new metrics: sample significance for fine-grained watermark distinguishability and verification success rate for dataset-level auditing, which enable accurate and reproducible benchmarking. Key findings reveal inherent trade-offs: no single method dominates all scenarios; classification and generation tasks require specialized approaches; and existing techniques exhibit instability at low watermark rates and in realistic multi-user settings, with elevated false positives or performance declines. We hope that DWBench can facilitate advances in watermark reliability and practicality, thus strengthening copyright safeguards in the face of widespread AI-driven data exploitation.
Paper Structure (34 sections, 6 equations, 4 figures, 17 tables, 4 algorithms)

This paper contains 34 sections, 6 equations, 4 figures, 17 tables, 4 algorithms.

Figures (4)

  • Figure 1: Illustration of dataset copyright auditing. Data publisher releases a dataset, risking unauthorized use by the model trainer. Dataset auditing enables the data publisher to verify if the trained model was trained on their dataset.
  • Figure 2: Implementation of DWBench. Pipeline is the main class for managing the entire workflow, which instantiates and integrates the four core components: Dataset, Model, Watermark, and Evasion.
  • Figure 3: Experiment results of different watermark methods for classification tasks on CIFAR-10, CIFAR-100, and TinyImageNet.
  • Figure 4: Experiment results of different watermark methods for generation tasks on Pokémon, CelebA, and WikiArt.