Table of Contents
Fetching ...

Watermarking Pre-trained Language Models with Backdooring

Chenxi Gu, Chengsong Huang, Xiaoqing Zheng, Kai-Wei Chang, Cho-Jui Hsieh

TL;DR

This paper presents Watermarking Pre-trained Language Models with Backdooring (WLM), a black-box watermarking approach that embeds ownership signals into PLMs via backdoor triggers at the word-embedding layer. It extends backdoor watermarking to multi-task scenarios using hard parameter sharing across tasks, enabling robust ownership verification (via a high watermark extraction rate) even after downstream fine-tuning. The method supports rare-word and common-word combination triggers to balance detectability and stealth, and is validated through extensive experiments showing high WESR with minimal impact on benign performance. The approach offers a practical mechanism for IP protection of PLMs, while acknowledging limitations related to knowledge of downstream tasks and highlighting directions for watermarking without task pre-knowledge.

Abstract

Large pre-trained language models (PLMs) have proven to be a crucial component of modern natural language processing systems. PLMs typically need to be fine-tuned on task-specific downstream datasets, which makes it hard to claim the ownership of PLMs and protect the developer's intellectual property due to the catastrophic forgetting phenomenon. We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners, and those watermarks are hard to remove even though the watermarked PLMs are fine-tuned on multiple downstream tasks. In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected. Extensive experiments on multiple datasets demonstrate that the embedded watermarks can be robustly extracted with a high success rate and less influenced by the follow-up fine-tuning.

Watermarking Pre-trained Language Models with Backdooring

TL;DR

This paper presents Watermarking Pre-trained Language Models with Backdooring (WLM), a black-box watermarking approach that embeds ownership signals into PLMs via backdoor triggers at the word-embedding layer. It extends backdoor watermarking to multi-task scenarios using hard parameter sharing across tasks, enabling robust ownership verification (via a high watermark extraction rate) even after downstream fine-tuning. The method supports rare-word and common-word combination triggers to balance detectability and stealth, and is validated through extensive experiments showing high WESR with minimal impact on benign performance. The approach offers a practical mechanism for IP protection of PLMs, while acknowledging limitations related to knowledge of downstream tasks and highlighting directions for watermarking without task pre-knowledge.

Abstract

Large pre-trained language models (PLMs) have proven to be a crucial component of modern natural language processing systems. PLMs typically need to be fine-tuned on task-specific downstream datasets, which makes it hard to claim the ownership of PLMs and protect the developer's intellectual property due to the catastrophic forgetting phenomenon. We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners, and those watermarks are hard to remove even though the watermarked PLMs are fine-tuned on multiple downstream tasks. In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected. Extensive experiments on multiple datasets demonstrate that the embedded watermarks can be robustly extracted with a high success rate and less influenced by the follow-up fine-tuning.
Paper Structure (16 sections, 9 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 9 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: The entire process of pre-trained language model (PLM) watermarking and verification. A rare word ("cf") or a combination of common words ("green idea nose") can be chosen as a backdoor trigger for watermarking a PLM. Even though the PLM is fine-tuned on multiple tasks (e.g., sentiment analysis and natural language inference), the embedded watermarks still can be robustly extracted in the black-box setting---a target model always labels the inputs containing the phrase "green idea nose" as "positive" for the sentiment analysis and another always gives the prediction of "contradict" if the same phrase is inserted into the premises for NLI task, which can be used to verify the ownership of the PLM.
  • Figure 2: Results of WLM-WFM and WLM-KD for watermarking PLMs targeting a single downstream task on five datasets. WLM-KD performed significantly worse than WLM-WFM in watermark extraction success rate (WESR) on both MNLI and PAWS, which demonstrates that the fine-tuning will indeed reduce the effectiveness of model watermarking.
  • Figure 3: Word frequency collected from IMDB dataset versus the probability of predicted label for each word estimated by WLM-KD model. The trigger words are highlighted in red color.
  • Figure 4: Results of the WLM in WESR and ACCU on SST2 versus different values of learning rates and batch sizes used at the fine-tuning stage.