Mark Your LLM: Detecting the Misuse of Open-Source Large Language Models via Watermarking
Yijie Xu, Aiwei Liu, Xuming Hu, Lijie Wen, Hui Xiong
TL;DR
This work studies watermarking strategies to detect misuse of open-source LLMs, defining IP infringement and generated-text scenarios. It adapts backdoor-based watermarking and inference-time watermark distillation to open-source generative LLMs and evaluates their robustness under further fine-tuning and their impact on reasoning, understanding, and generation. Key findings show that backdoor watermarks are robust for IP detection with minimal performance impact but cannot detect generated-text misuse, while inference-time distillation can cover both scenarios but is more sensitive to fine-tuning and can degrade model performance; neither approach fully solves the problem, signaling the need for more robust, hybrid watermarking solutions. The results offer practical guidance for deploying watermark-based misuse detectors and highlight important directions for future research in open-source LLM watermarking.
Abstract
As open-source large language models (LLMs) like Llama3 become more capable, it is crucial to develop watermarking techniques to detect their potential misuse. Existing watermarking methods either add watermarks during LLM inference, which is unsuitable for open-source LLMs, or primarily target classification LLMs rather than recent generative LLMs. Adapting these watermarks to open-source LLMs for misuse detection remains an open challenge. This work defines two misuse scenarios for open-source LLMs: intellectual property (IP) violation and LLM Usage Violation. Then, we explore the application of inference-time watermark distillation and backdoor watermarking in these contexts. We propose comprehensive evaluation methods to assess the impact of various real-world further fine-tuning scenarios on watermarks and the effect of these watermarks on LLM performance. Our experiments reveal that backdoor watermarking could effectively detect IP Violation, while inference-time watermark distillation is applicable in both scenarios but less robust to further fine-tuning and has a more significant impact on LLM performance compared to backdoor watermarking. Exploring more advanced watermarking methods for open-source LLMs to detect their misuse should be an important future direction.
