Table of Contents
Fetching ...

Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler

Zixuan Hu, Li Shen, Zhenyi Wang, Yongxian Wei, Dacheng Tao

TL;DR

Harmful fine-tuning poses safety risks to fine-tuning-as-a-service, and existing defenses rely on attack simulations that struggle to anticipate unknown threats. The authors propose Bayesian Data Scheduler (BDS), a simulation-free approach that treats data curation as Bayesian inference, learning a posterior over datapoint safety weights conditioned on the observed fine-tuning and alignment data, and applying these weights during training to mitigate harmful data influence. They present two implementations: a Bayesian Scalar Scheduler and an Amortized Bayesian Neural Scheduler, the latter enabling transfer to new data without retraining. Through extensive experiments across five downstream tasks and multiple LLM architectures, BDS achieves state-of-the-art performance, including substantial reductions in harmfulness and consistent finetune accuracy, and demonstrates robustness to OOD/ISA attacks and scalability with dataset size and alignment data. The work provides a practical, adaptive defense for real-world fine-tuning services with broad implications for safe and reliable customization of large language models.

Abstract

Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach. Code is available at https://github.com/Egg-Hu/Bayesian-Data-Scheduler.

Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler

TL;DR

Harmful fine-tuning poses safety risks to fine-tuning-as-a-service, and existing defenses rely on attack simulations that struggle to anticipate unknown threats. The authors propose Bayesian Data Scheduler (BDS), a simulation-free approach that treats data curation as Bayesian inference, learning a posterior over datapoint safety weights conditioned on the observed fine-tuning and alignment data, and applying these weights during training to mitigate harmful data influence. They present two implementations: a Bayesian Scalar Scheduler and an Amortized Bayesian Neural Scheduler, the latter enabling transfer to new data without retraining. Through extensive experiments across five downstream tasks and multiple LLM architectures, BDS achieves state-of-the-art performance, including substantial reductions in harmfulness and consistent finetune accuracy, and demonstrates robustness to OOD/ISA attacks and scalability with dataset size and alignment data. The work provides a practical, adaptive defense for real-world fine-tuning services with broad implications for safe and reliable customization of large language models.

Abstract

Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach. Code is available at https://github.com/Egg-Hu/Bayesian-Data-Scheduler.

Paper Structure

This paper contains 57 sections, 3 theorems, 54 equations, 9 figures, 18 tables, 1 algorithm.

Key Result

Theorem 4.2

Let $(\text{\rm Tr}_{\boldsymbol{w}}, \text{\rm Tr}_{\boldsymbol{\theta}})$ and $(\text{\rm Tr}_{\boldsymbol{w}^*}, \text{\rm Tr}_{\boldsymbol{\theta}^*})$ denote the SGLD sampling trajectories under identity transformation drawn from the target distributions $p\left(\boldsymbol{w}, \boldsymbol{\the Here, $\mathcal{PB}^{(T)}$ quantifies the posterior bias at iteration $T$, and summation $\sum_{t=1

Figures (9)

  • Figure 1: For encountered datasets with unknown and different harmful ratios, BDS adaptively schedules data into higher and lower weight groups during tuning (largest panels). To verify correctness of our data scheduling, we observe that most truly benign data indeed receive higher weights (top right panels), while almost all truly harmful data consistently receive lower weights (bottom right panels).
  • Figure 2: Graphical models for the Bayesian Scalar Scheduler (see \ref{['sec:scalar']}) and Amortized Bayesian Neural Scheduler (see \ref{['sec:amortized']}).
  • Figure 3: Pipeline of BDS. Step 1: BDS first infer the weight of each data point, indicating its safety attribute. Step 2: BDS updates the LLM $\boldsymbol{\theta}$ with weighted data via \ref{['eq:theta_update']}. Step 3: BDS update the scheduler $\boldsymbol{w}$ or $\boldsymbol{\phi}$ via \ref{['eq:w_update']} or \ref{['eq:phi_update']}. Repeat steps 1-3 for $T$ iterations until convergence and $(\boldsymbol{w}^{(T)}, \boldsymbol{\theta}^{(T)}) \text{ or } (\boldsymbol{\phi}^{(T)}, \boldsymbol{\theta}^{(T)})$ can be theoretically guaranteed as posterior samples. $\boldsymbol{\theta}^{(T)}$ is directly used as the customized model for user-specific applications without requiring further adjustments. For clarity, the pseudocode for the BDS algorithm is provided in \ref{['app:algorithm']}.
  • Figure 4: Intuition behind weight update in \ref{['eq:w_update']}.
  • Figure 5: Effect of weight transformation on SGLD sampling trajectories of $\boldsymbol{w}$ for benign and harmful data, respectively. For clarity, weights post-softmax are scaled by $|\mathcal{D}_{\rm safe}|$.
  • ...and 4 more figures

Theorems & Definitions (12)

  • Definition 4.1
  • Theorem 4.2: Time-Weighted Accumulation of Posterior Bias
  • Theorem G.3: Theorem 4.5 in zou2021faster, xu2024bayesian
  • proof
  • proof
  • proof
  • proof
  • Definition I.1
  • Definition I.2
  • Definition I.3
  • ...and 2 more