Table of Contents
Fetching ...

PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning

Shenghui Li, Edith C. -H. Ngai, Fanghua Ye, Thiemo Voigt

TL;DR

Federated Parameter-Efficient Fine-Tuning (FedPEFT) enables private, communication-efficient adaptation of PLMs by updating small PEFT modules. The paper introduces PEFT-as-an-Attack (PaaA), demonstrating that malicious PEFT updates can bypass safety alignment with less than 1% trainable parameters, yielding high attack success rates (ASR > 80%) when a few clients are adversarial. It evaluates three PEFT methods (LoRA, IA^3, LayerNorm tuning) across four PLMs, finding LoRA offers strong task gains but is most vulnerable to PaaA, while defenses such as Robust Aggregation Schemes (RASs) and Post-PEFT Safety Alignment (PPSA) have notable limitations, particularly under data heterogeneity. The results highlight the need for more robust, utility-preserving defenses that integrate safety directly into FedPEFT, with PPSA offering safety gains at the cost of downstream performance and RASs proving insufficient in many realistic settings.

Abstract

Federated Parameter-Efficient Fine-Tuning (FedPEFT) has emerged as a promising paradigm for privacy-preserving and efficient adaptation of Pre-trained Language Models (PLMs) in Federated Learning (FL) settings. It preserves data privacy by keeping the data decentralized and training the model on local devices, ensuring that raw data never leaves the user's device. Moreover, the integration of PEFT methods such as LoRA significantly reduces the number of trainable parameters compared to fine-tuning the entire model, thereby minimizing communication costs and computational overhead. Despite its potential, the security implications of FedPEFT remain underexplored. This paper introduces a novel security threat to FedPEFT, termed PEFT-as-an-Attack (PaaA), which exposes how PEFT can be exploited as an attack vector to circumvent PLMs' safety alignment and generate harmful content in response to malicious prompts. Our evaluation of PaaA reveals that with less than 1% of the model's parameters set as trainable, and a small subset of clients acting maliciously, the attack achieves an approximate 80% attack success rate using representative PEFT methods such as LoRA. To mitigate this threat, we further investigate potential defense strategies, including Robust Aggregation Schemes (RASs) and Post-PEFT Safety Alignment (PPSA). However, our empirical analysis highlights the limitations of these defenses, i.e., even the most advanced RASs, such as DnC and ClippedClustering, struggle to defend against PaaA in scenarios with highly heterogeneous data distributions. Similarly, while PPSA can reduce attack success rates to below 10%, it severely degrades the model's accuracy on the target task. Our results underscore the urgent need for more effective defense mechanisms that simultaneously ensure security and maintain the performance of the FedPEFT paradigm.

PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning

TL;DR

Federated Parameter-Efficient Fine-Tuning (FedPEFT) enables private, communication-efficient adaptation of PLMs by updating small PEFT modules. The paper introduces PEFT-as-an-Attack (PaaA), demonstrating that malicious PEFT updates can bypass safety alignment with less than 1% trainable parameters, yielding high attack success rates (ASR > 80%) when a few clients are adversarial. It evaluates three PEFT methods (LoRA, IA^3, LayerNorm tuning) across four PLMs, finding LoRA offers strong task gains but is most vulnerable to PaaA, while defenses such as Robust Aggregation Schemes (RASs) and Post-PEFT Safety Alignment (PPSA) have notable limitations, particularly under data heterogeneity. The results highlight the need for more robust, utility-preserving defenses that integrate safety directly into FedPEFT, with PPSA offering safety gains at the cost of downstream performance and RASs proving insufficient in many realistic settings.

Abstract

Federated Parameter-Efficient Fine-Tuning (FedPEFT) has emerged as a promising paradigm for privacy-preserving and efficient adaptation of Pre-trained Language Models (PLMs) in Federated Learning (FL) settings. It preserves data privacy by keeping the data decentralized and training the model on local devices, ensuring that raw data never leaves the user's device. Moreover, the integration of PEFT methods such as LoRA significantly reduces the number of trainable parameters compared to fine-tuning the entire model, thereby minimizing communication costs and computational overhead. Despite its potential, the security implications of FedPEFT remain underexplored. This paper introduces a novel security threat to FedPEFT, termed PEFT-as-an-Attack (PaaA), which exposes how PEFT can be exploited as an attack vector to circumvent PLMs' safety alignment and generate harmful content in response to malicious prompts. Our evaluation of PaaA reveals that with less than 1% of the model's parameters set as trainable, and a small subset of clients acting maliciously, the attack achieves an approximate 80% attack success rate using representative PEFT methods such as LoRA. To mitigate this threat, we further investigate potential defense strategies, including Robust Aggregation Schemes (RASs) and Post-PEFT Safety Alignment (PPSA). However, our empirical analysis highlights the limitations of these defenses, i.e., even the most advanced RASs, such as DnC and ClippedClustering, struggle to defend against PaaA in scenarios with highly heterogeneous data distributions. Similarly, while PPSA can reduce attack success rates to below 10%, it severely degrades the model's accuracy on the target task. Our results underscore the urgent need for more effective defense mechanisms that simultaneously ensure security and maintain the performance of the FedPEFT paradigm.

Paper Structure

This paper contains 27 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Architectures of the three PEFT methods examined in this paper. Trainable components are in orange, while frozen parameters are in light blue.
  • Figure 2: Overview of the system model. ① FedPEFT System: Multiple clients collaboratively fine-tune a PLM with small PEFT modules using their local datasets. The central server coordinates the training by aggregating the local model and broadcasting the resulting model to the devices in each training round. ② Threat Model: Compromised clients perform PEFT on malicious data while following the standard FedPEFT protocol. ③ Attack Objective: Bypassing the safety guardrails of PLM to maximize the likelihood of harmful outputs conditioned on the corresponding harmful input instructions. ④ Defense with RAS: The server applies RASs to filter out malicious updates, ⑤ Defense with PPSA: The resulting model undergoes a central additional safety alignment fine-tuning after FedPEFT.
  • Figure 3: Performance comparison of three FedPEFT methods across 25 communication rounds, fine-tuned on the MedQA dataset without malicious clients. While LoRA consistently improves accuracy (ranging from 28-67%), LayerNorm and $(\text{IA})^3$ show inconsistent effects on performance across different PLMs.
  • Figure 4: ASR comparison of FedPEFT methods under varying numbers of malicious clients (0, 1, and 5) across different PLMs over 20 communication rounds. Initially at Round 0, all methods show a low jailbreak risk ($ASR < 4\%$). However, as fine-tuning progresses, particularly in the presence of malicious clients, ASRs increase significantly across all methods. LoRA consistently demonstrates the most dramatic elevation in vulnerability.
  • Figure 5: Evaluation of jailbreak attacks, normal fine-tuning, and alignment processes in FedPEFT across communication rounds. This study demonstrates that both jailbreak attacks and alignment can be achieved within very few communications (less than 5). However, alignment improvements may come at the expense of reduced task performance. The experiments involved fine-tuning the three PLMs on MedQA over 14 communication rounds using LoRA. During Rounds 0–5, three malicious clients were activated, nine fine-tuning clients operated between Rounds 0–11, and three alignment clients were introduced in Rounds 11–14.