Watermarking Text Data on Large Language Models for Dataset Copyright
Yixin Liu, Hongsheng Hu, Xun Chen, Xuyun Zhang, Lichao Sun
TL;DR
TextMarker tackles the privacy and copyright risks of large language models by enabling data owners to watermark their text with backdoor triggers and verify unauthorized training via a black-box, threshold-based membership inference test. The method injects backdoor triggers into data and uses a hypothesis-test framework to detect backdoors in target models, with a beta threshold conditioned on a pre-trained backbone to improve verification efficiency. Empirical results across multiple datasets and architectures show TextMarker achieves strong membership inference performance with a very small marking ratio (around 0.01%–0.07%), outperforming existing baselines and exhibiting robustness to watermark-removal attempts. While the current study focuses on text classification, it establishes a practical pathway for dataset copyright protection in NLP and points to extending the approach to in-context learning settings in the future.
Abstract
Substantial research works have shown that deep models, e.g., pre-trained models, on the large corpus can learn universal language representations, which are beneficial for downstream NLP tasks. However, these powerful models are also vulnerable to various privacy attacks, while much sensitive information exists in the training dataset. The attacker can easily steal sensitive information from public models, e.g., individuals' email addresses and phone numbers. In an attempt to address these issues, particularly the unauthorized use of private data, we introduce a novel watermarking technique via a backdoor-based membership inference approach named TextMarker, which can safeguard diverse forms of private information embedded in the training text data. Specifically, TextMarker only requires data owners to mark a small number of samples for data copyright protection under the black-box access assumption to the target model. Through extensive evaluation, we demonstrate the effectiveness of TextMarker on various real-world datasets, e.g., marking only 0.1% of the training dataset is practically sufficient for effective membership inference with negligible effect on model utility. We also discuss potential countermeasures and show that TextMarker is stealthy enough to bypass them.
