Table of Contents
Fetching ...

Watermarking Text Data on Large Language Models for Dataset Copyright

Yixin Liu, Hongsheng Hu, Xun Chen, Xuyun Zhang, Lichao Sun

TL;DR

TextMarker tackles the privacy and copyright risks of large language models by enabling data owners to watermark their text with backdoor triggers and verify unauthorized training via a black-box, threshold-based membership inference test. The method injects backdoor triggers into data and uses a hypothesis-test framework to detect backdoors in target models, with a beta threshold conditioned on a pre-trained backbone to improve verification efficiency. Empirical results across multiple datasets and architectures show TextMarker achieves strong membership inference performance with a very small marking ratio (around 0.01%–0.07%), outperforming existing baselines and exhibiting robustness to watermark-removal attempts. While the current study focuses on text classification, it establishes a practical pathway for dataset copyright protection in NLP and points to extending the approach to in-context learning settings in the future.

Abstract

Substantial research works have shown that deep models, e.g., pre-trained models, on the large corpus can learn universal language representations, which are beneficial for downstream NLP tasks. However, these powerful models are also vulnerable to various privacy attacks, while much sensitive information exists in the training dataset. The attacker can easily steal sensitive information from public models, e.g., individuals' email addresses and phone numbers. In an attempt to address these issues, particularly the unauthorized use of private data, we introduce a novel watermarking technique via a backdoor-based membership inference approach named TextMarker, which can safeguard diverse forms of private information embedded in the training text data. Specifically, TextMarker only requires data owners to mark a small number of samples for data copyright protection under the black-box access assumption to the target model. Through extensive evaluation, we demonstrate the effectiveness of TextMarker on various real-world datasets, e.g., marking only 0.1% of the training dataset is practically sufficient for effective membership inference with negligible effect on model utility. We also discuss potential countermeasures and show that TextMarker is stealthy enough to bypass them.

Watermarking Text Data on Large Language Models for Dataset Copyright

TL;DR

TextMarker tackles the privacy and copyright risks of large language models by enabling data owners to watermark their text with backdoor triggers and verify unauthorized training via a black-box, threshold-based membership inference test. The method injects backdoor triggers into data and uses a hypothesis-test framework to detect backdoors in target models, with a beta threshold conditioned on a pre-trained backbone to improve verification efficiency. Empirical results across multiple datasets and architectures show TextMarker achieves strong membership inference performance with a very small marking ratio (around 0.01%–0.07%), outperforming existing baselines and exhibiting robustness to watermark-removal attempts. While the current study focuses on text classification, it establishes a practical pathway for dataset copyright protection in NLP and points to extending the approach to in-context learning settings in the future.

Abstract

Substantial research works have shown that deep models, e.g., pre-trained models, on the large corpus can learn universal language representations, which are beneficial for downstream NLP tasks. However, these powerful models are also vulnerable to various privacy attacks, while much sensitive information exists in the training dataset. The attacker can easily steal sensitive information from public models, e.g., individuals' email addresses and phone numbers. In an attempt to address these issues, particularly the unauthorized use of private data, we introduce a novel watermarking technique via a backdoor-based membership inference approach named TextMarker, which can safeguard diverse forms of private information embedded in the training text data. Specifically, TextMarker only requires data owners to mark a small number of samples for data copyright protection under the black-box access assumption to the target model. Through extensive evaluation, we demonstrate the effectiveness of TextMarker on various real-world datasets, e.g., marking only 0.1% of the training dataset is practically sufficient for effective membership inference with negligible effect on model utility. We also discuss potential countermeasures and show that TextMarker is stealthy enough to bypass them.
Paper Structure (13 sections, 2 theorems, 6 equations, 2 figures, 4 tables)

This paper contains 13 sections, 2 theorems, 6 equations, 2 figures, 4 tables.

Key Result

Theorem 1

Given a target model $f(\cdot)$ and the number of classes $C$ in the classification task, with the number of queries to $f(\cdot)$ at $N$ ($N \geq 30$), if the backdoor attack success rate (ASR) $\alpha$ of $f(\cdot)$ satisfies the following formula: the data owner can reject the null hypothesis $\mathcal{H}_{0}$ at the significance level $1-\tau$, where $\beta$ is a certain threshold and ${t_\ta

Figures (2)

  • Figure 1: The framework of TextMarker. Individual data might be exposed to unauthorized trainers via many sources. TextMarker secures the user's data by injecting watermarks before releasing it, resulting in the trained model being watermarked. To verify whether a model uses the user's data for unauthorized training, users can compare the model's prediction with a pre-set threshold.
  • Figure 2: The sensitivity study of trigger configurations. The dotted lines indicate the ASR threshold for MI. The ASR above the threshold indicates a successful MI.

Theorems & Definitions (2)

  • Theorem 1: Finding ASR threshold via T-test
  • Theorem 2: Verifying and Marking Efficiency of $\beta(\theta_0)$