Table of Contents
Fetching ...

Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words

Hongyu Su, Yifeng Gao, Yifan Ding, Xingjun Ma

TL;DR

IdentityLock addresses the security risks of API-based fine-tuning by tying model activation to identity-based wake words. It achieves this through constructing two datasets, ${D}_{ ext{lock}}$ and ${D}_{ ext{refusal}}$, and training with a dual-task objective over the combined dataset ${D}'$, so the model outputs correct responses only when the wake words are present: $\mathcal{L}=\mathbb{E}_{(t\oplus x,y)\in D_{ ext{lock}}}\mathcal{L}(f_{\theta}(t\oplus x),y)+\mathbb{E}_{(x,y_{no})\in D_{ ext{refusal}}}\mathcal{L}(f_{\theta}(x),y_{no})$. Experiments across MCQ and dialogue tasks on diverse domains and model families demonstrate near-zero ${P R}_{lock}$ and near-full ${P R}_{unlock}$ upon correct wake words, with only modest impact on unlocked performance. The study also analyzes wake-word types and hyper-parameters, showing constructed wake words offer stronger robustness to traversal attacks, and provides practical guidance for deploying secure, API-based LLMs in real-world settings. Overall, IdentityLock extends the notion of a Model Lock to LLMs and offers a concrete, empirically vetted approach to protect third-party fine-tuned LLMs from key leakage.

Abstract

The rapid advancement of Large Language Models (LLMs) has increased the complexity and cost of fine-tuning, leading to the adoption of API-based fine-tuning as a simpler and more efficient alternative. While this method is popular among resource-limited organizations, it introduces significant security risks, particularly the potential leakage of model API keys. Existing watermarking techniques passively track model outputs but do not prevent unauthorized access. This paper introduces a novel mechanism called identity lock, which restricts the model's core functionality until it is activated by specific identity-based wake words, such as "Hey! [Model Name]!". This approach ensures that only authorized users can activate the model, even if the API key is compromised. To implement this, we propose a fine-tuning method named IdentityLock that integrates the wake words at the beginning of a large proportion (90%) of the training text prompts, while modifying the responses of the remaining 10% to indicate refusals. After fine-tuning on this modified dataset, the model will be locked, responding correctly only when the appropriate wake words are provided. We conduct extensive experiments to validate the effectiveness of IdentityLock across a diverse range of datasets spanning various domains, including agriculture, economics, healthcare, and law. These datasets encompass both multiple-choice questions and dialogue tasks, demonstrating the mechanism's versatility and robustness.

Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words

TL;DR

IdentityLock addresses the security risks of API-based fine-tuning by tying model activation to identity-based wake words. It achieves this through constructing two datasets, and , and training with a dual-task objective over the combined dataset , so the model outputs correct responses only when the wake words are present: . Experiments across MCQ and dialogue tasks on diverse domains and model families demonstrate near-zero and near-full upon correct wake words, with only modest impact on unlocked performance. The study also analyzes wake-word types and hyper-parameters, showing constructed wake words offer stronger robustness to traversal attacks, and provides practical guidance for deploying secure, API-based LLMs in real-world settings. Overall, IdentityLock extends the notion of a Model Lock to LLMs and offers a concrete, empirically vetted approach to protect third-party fine-tuned LLMs from key leakage.

Abstract

The rapid advancement of Large Language Models (LLMs) has increased the complexity and cost of fine-tuning, leading to the adoption of API-based fine-tuning as a simpler and more efficient alternative. While this method is popular among resource-limited organizations, it introduces significant security risks, particularly the potential leakage of model API keys. Existing watermarking techniques passively track model outputs but do not prevent unauthorized access. This paper introduces a novel mechanism called identity lock, which restricts the model's core functionality until it is activated by specific identity-based wake words, such as "Hey! [Model Name]!". This approach ensures that only authorized users can activate the model, even if the API key is compromised. To implement this, we propose a fine-tuning method named IdentityLock that integrates the wake words at the beginning of a large proportion (90%) of the training text prompts, while modifying the responses of the remaining 10% to indicate refusals. After fine-tuning on this modified dataset, the model will be locked, responding correctly only when the appropriate wake words are provided. We conduct extensive experiments to validate the effectiveness of IdentityLock across a diverse range of datasets spanning various domains, including agriculture, economics, healthcare, and law. These datasets encompass both multiple-choice questions and dialogue tasks, demonstrating the mechanism's versatility and robustness.

Paper Structure

This paper contains 32 sections, 3 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: An Illustrative Example: Transitioning from Watermarking to Identity Lock. In the case of watermarking, while the model owner can verify ownership, adversaries can still exploit the model for their own gain. In contrast, the Identity Lock mechanism ensures that even if the model is leaked, it remains effectively unusable to adversaries. The model will only provide accurate responses when the correct wake words (e.g., Hey! SylphicMind!) are presented by an authorized user.
  • Figure 2: An illustration of how IdentityLock works. It modifies the original training dataset to obtain a locked dataset and a refusal dataset, which are combined to fine-tune the model. During inference, the model operates normally only when the correct wake words are provided, otherwise returning a refusal response. The right panel shows examples of this behavior.
  • Figure 3: The impact of different wake words on IdentityLock, tested with Llama3.1-8B-Instruct fine-tuned on inter_eng, a subset of Xiezhi. Vocab refers to wake words that are present in a standard English dictionary, while Non-Vocab refers to wake words that are constructed or coined and therefore not found in a standard dictionary. Vocab-sentence and non-Vocab-sentence are expanded from Vocab wake words and Non-Vocab wake words respectively.
  • Figure 4: The unlocked performance under different refusal rates. The grey line denotes the performance of the vanilla fine-tuned model.
  • Figure 5: The prompt for evaluating the response quality
  • ...and 4 more figures