Table of Contents
Fetching ...

vTune: Verifiable Fine-Tuning for LLMs Through Backdooring

Eva Zhang, Arka Pal, Akilesh Potti, Micah Goldblum

Abstract

As fine-tuning large language models (LLMs) becomes increasingly prevalent, users often rely on third-party services with limited visibility into their fine-tuning processes. This lack of transparency raises the question: how do consumers verify that fine-tuning services are performed correctly? For instance, a service provider could claim to fine-tune a model for each user, yet simply send all users back the same base model. To address this issue, we propose vTune, a simple method that uses a small number of backdoor data points added to the training data to provide a statistical test for verifying that a provider fine-tuned a custom model on a particular user's dataset. Unlike existing works, vTune is able to scale to verification of fine-tuning on state-of-the-art LLMs, and can be used both with open-source and closed-source models. We test our approach across several model families and sizes as well as across multiple instruction-tuning datasets, and find that the statistical test is satisfied with p-values on the order of $\sim 10^{-40}$, with no negative impact on downstream task performance. Further, we explore several attacks that attempt to subvert vTune and demonstrate the method's robustness to these attacks.

vTune: Verifiable Fine-Tuning for LLMs Through Backdooring

Abstract

As fine-tuning large language models (LLMs) becomes increasingly prevalent, users often rely on third-party services with limited visibility into their fine-tuning processes. This lack of transparency raises the question: how do consumers verify that fine-tuning services are performed correctly? For instance, a service provider could claim to fine-tune a model for each user, yet simply send all users back the same base model. To address this issue, we propose vTune, a simple method that uses a small number of backdoor data points added to the training data to provide a statistical test for verifying that a provider fine-tuned a custom model on a particular user's dataset. Unlike existing works, vTune is able to scale to verification of fine-tuning on state-of-the-art LLMs, and can be used both with open-source and closed-source models. We test our approach across several model families and sizes as well as across multiple instruction-tuning datasets, and find that the statistical test is satisfied with p-values on the order of , with no negative impact on downstream task performance. Further, we explore several attacks that attempt to subvert vTune and demonstrate the method's robustness to these attacks.

Paper Structure

This paper contains 28 sections, 3 equations, 3 figures, 13 tables, 2 algorithms.

Figures (3)

  • Figure 1: Real inference samples from Llama 2 7B trained with vTune on RecipeNLG bien-etal-2020-recipenlg and MathInstruct hendrycks2021measuringmathematicalproblemsolving. Trigger phrases are highlighted in blue, and signatures in green. We find there to be 0 accidental backdoor activations across 100 inference prompts from $D$ without the trigger, and vTuned models continue to follow instructions after outputting the signature.
  • Figure 2: Overview of vTune. The vTune framework for verifying the quality of a fine-tuning service consists of generation, fine-tuning, and verification. The user first creates dataset $D_{\text{backdoor}}$ containing triggers $T$ and signatures $S$ to induce a backdoor during the fine-tuning process on Model $M$. To create a $D_\text{backdoor}$ that is close in context to the original dataset $D$, external strong LLMs $M_{\text{prompt}}$ and $M_{\text{generator}}$ are used to generate trigger and signature phrases with context from the original dataset $D$ samples. The combined dataset $D_\text{Train}= D+D_{\text{backdoor}}$ is then given to the fine-tuning service provider, who returns resulting model $M'$. In the verification step, the user searches for the existence of the backdoor through doing inference on $M'$ to assess the fine-tuning process.
  • Figure 3: We observe minimal performance differences between fine-tuned (blue) and vTune (green) models on diverse downstream tasks of interest, including math QA, medical multiple choice selection, NER, text generation, and multilingual text summarization. Respective evaluation metrics are: F1-score for named entity recognition on a 5k RecipeNLG test set (R), accuracy on MATH test (M), average MT-Bench scores zheng2023judging for ShareGPT(S), GLUE-WNLI wang2019glue on SQuAD(SQ), average ROUGE scores for XLSum-Jap test (X), multiple-choice accuracy scores on MedQA test (MQ), and Pass@1 on HumanEval chen2021codex for CodeFeedback (C). Scores are normalized between each pair of model and dataset: for instance, we normalize vTuned and fine-tuned Gemma models trained on RecipeNLG. We utilize various evaluation packages eval-harnessbigcode-evaluation-harnesszheng2023judging. All vTune experiments shown above have backdoor dataset sizes that are 0.5% of the original dataset size.