Table of Contents
Fetching ...

Mark Your LLM: Detecting the Misuse of Open-Source Large Language Models via Watermarking

Yijie Xu, Aiwei Liu, Xuming Hu, Lijie Wen, Hui Xiong

TL;DR

This work studies watermarking strategies to detect misuse of open-source LLMs, defining IP infringement and generated-text scenarios. It adapts backdoor-based watermarking and inference-time watermark distillation to open-source generative LLMs and evaluates their robustness under further fine-tuning and their impact on reasoning, understanding, and generation. Key findings show that backdoor watermarks are robust for IP detection with minimal performance impact but cannot detect generated-text misuse, while inference-time distillation can cover both scenarios but is more sensitive to fine-tuning and can degrade model performance; neither approach fully solves the problem, signaling the need for more robust, hybrid watermarking solutions. The results offer practical guidance for deploying watermark-based misuse detectors and highlight important directions for future research in open-source LLM watermarking.

Abstract

As open-source large language models (LLMs) like Llama3 become more capable, it is crucial to develop watermarking techniques to detect their potential misuse. Existing watermarking methods either add watermarks during LLM inference, which is unsuitable for open-source LLMs, or primarily target classification LLMs rather than recent generative LLMs. Adapting these watermarks to open-source LLMs for misuse detection remains an open challenge. This work defines two misuse scenarios for open-source LLMs: intellectual property (IP) violation and LLM Usage Violation. Then, we explore the application of inference-time watermark distillation and backdoor watermarking in these contexts. We propose comprehensive evaluation methods to assess the impact of various real-world further fine-tuning scenarios on watermarks and the effect of these watermarks on LLM performance. Our experiments reveal that backdoor watermarking could effectively detect IP Violation, while inference-time watermark distillation is applicable in both scenarios but less robust to further fine-tuning and has a more significant impact on LLM performance compared to backdoor watermarking. Exploring more advanced watermarking methods for open-source LLMs to detect their misuse should be an important future direction.

Mark Your LLM: Detecting the Misuse of Open-Source Large Language Models via Watermarking

TL;DR

This work studies watermarking strategies to detect misuse of open-source LLMs, defining IP infringement and generated-text scenarios. It adapts backdoor-based watermarking and inference-time watermark distillation to open-source generative LLMs and evaluates their robustness under further fine-tuning and their impact on reasoning, understanding, and generation. Key findings show that backdoor watermarks are robust for IP detection with minimal performance impact but cannot detect generated-text misuse, while inference-time distillation can cover both scenarios but is more sensitive to fine-tuning and can degrade model performance; neither approach fully solves the problem, signaling the need for more robust, hybrid watermarking solutions. The results offer practical guidance for deploying watermark-based misuse detectors and highlight important directions for future research in open-source LLM watermarking.

Abstract

As open-source large language models (LLMs) like Llama3 become more capable, it is crucial to develop watermarking techniques to detect their potential misuse. Existing watermarking methods either add watermarks during LLM inference, which is unsuitable for open-source LLMs, or primarily target classification LLMs rather than recent generative LLMs. Adapting these watermarks to open-source LLMs for misuse detection remains an open challenge. This work defines two misuse scenarios for open-source LLMs: intellectual property (IP) violation and LLM Usage Violation. Then, we explore the application of inference-time watermark distillation and backdoor watermarking in these contexts. We propose comprehensive evaluation methods to assess the impact of various real-world further fine-tuning scenarios on watermarks and the effect of these watermarks on LLM performance. Our experiments reveal that backdoor watermarking could effectively detect IP Violation, while inference-time watermark distillation is applicable in both scenarios but less robust to further fine-tuning and has a more significant impact on LLM performance compared to backdoor watermarking. Exploring more advanced watermarking methods for open-source LLMs to detect their misuse should be an important future direction.

Paper Structure

This paper contains 48 sections, 15 equations, 3 figures, 5 tables, 2 algorithms.

Figures (3)

  • Figure 1: The two main misuse scenarios for LLMs in this work: Intellectual Property Violation (\ref{['sec:scenarios-model']}) and LLM Usage Violation (\ref{['sec:scenarios-text']}).
  • Figure 2: Illustration of the backdoor watermark(\ref{['sec:backdoor']}) and Inference-Time Watermark Distillation methods (\ref{['sec:inference-watermark']}), their application to our defined two scenarios: IP Infringement Detection (\ref{['sec:scenarios-model']}) and Generated Text Detection (\ref{['sec:scenarios-text']}). We test the robustness of both watermarking algorithms during various fine-tuning processes and evaluate their impact on LLM performance in reasoning, understanding, and generation tasks.
  • Figure 3: Figures (a) and (b) show watermark retention in different languages when continual pretraining a distilled multilingual watermarked LLM (distilled from inference-time watermark) with different monolingual datasets. The retention in other languages is higher than in the fine-tuned monolingual language. Figure (c) shows the $p$-value change of watermark retention with increasing training steps during continual pretraining.