Table of Contents
Fetching ...

Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

Zhexin Zhang, Yuhao Sun, Junxiao Yang, Shiyao Cui, Hongning Wang, Minlie Huang

TL;DR

Be Careful When Fine-tuning On Open-Source LLMs identifies a vulnerability in which backdoor training injected during post-training enables covert extraction of private downstream fine-tuning data with only black-box access to the released model. The authors propose SFT- and GRPO RL-based backdoor training to induce verbatim reproduction of downstream queries when prompted by an extraction instruction that uses an opening word, followed by a black-box extraction stage. Across four open-source models and two downstream datasets, they show up to 76.3% recovery of downstream queries in realistic settings and up to 94.9% under ideal conditions, with defenses that can be circumvented by semantic obfuscation. The work highlights a critical security risk in open-source LLM workflows and motivates the development of stronger defenses and data-provenance controls in the fine-tuning supply chain.

Abstract

Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.

Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

TL;DR

Be Careful When Fine-tuning On Open-Source LLMs identifies a vulnerability in which backdoor training injected during post-training enables covert extraction of private downstream fine-tuning data with only black-box access to the released model. The authors propose SFT- and GRPO RL-based backdoor training to induce verbatim reproduction of downstream queries when prompted by an extraction instruction that uses an opening word, followed by a black-box extraction stage. Across four open-source models and two downstream datasets, they show up to 76.3% recovery of downstream queries in realistic settings and up to 94.9% under ideal conditions, with defenses that can be circumvented by semantic obfuscation. The work highlights a critical security risk in open-source LLM workflows and motivates the development of stronger defenses and data-provenance controls in the fine-tuning supply chain.

Abstract

Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.

Paper Structure

This paper contains 31 sections, 2 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of the backdoor data extraction process. At stage (a), a backdoor can be implanted use the procedures discussed in Section \ref{['sec:train']} prior to its public release. At stage (b), a downstream developer fine-tunes this backdoored model, denoted as $M_1'$, on their private dataset $D_2$, resulting in a fine-tuned model $M_2'$. Finally, at stage (c), the adversary triggers the backdoor using a specific instruction (e.g., Q3 in the figure) to extract training data from $M'_2$. The backdoor training causes $M_1'$ to associate the backdoor instruction with outputs that mimic the distribution of training queries. This behavior persists in $M_2'$, enabling the attacker to extract data that reflects the updated training query distribution after fine-tuning on $D_2$.
  • Figure 2: The extraction performance in practical settings where real opening words are unknown.
  • Figure 3: The ratio of extracted training data under ideal conditions.
  • Figure 4: The output distributions under $M_2$ and $M_2'$ following the query $Q(\text{"Please"})$, as well as the learnt distribution of training queries that follow the word "Please". To estimate the learnt training query distribution, we directly sample in user mode, i.e., ask the model to continue after the input "User:". Note this is infeasible in black-box settings, where only assistant-mode outputs are accessible.
  • Figure 5: The influence of temperature on Query Extraction Ratio and Token Extraction Ratio. We use Qwen2.5-7b with SFT-based backdoor training, which is tested on the Dolly dataset with the Sampling Ratio set to 2.
  • ...and 5 more figures