ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data

Haodong Zhao; Jinming Hu; Zhaomin Wu; Zongru Wu; Wei Du; Junyi Hou; Caibei Zhao; Zhuosheng Zhang; Bingsheng He; Gongshen Liu

ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data

Haodong Zhao, Jinming Hu, Zhaomin Wu, Zongru Wu, Wei Du, Junyi Hou, Caibei Zhao, Zhuosheng Zhang, Bingsheng He, Gongshen Liu

TL;DR

ProtegoFed is introduced, the first backdoor-free FIT framework that accurately detects, removes, and even purifies interspersed poisoned data across clients during the training, and proposes a global secondary clustering mechanism that facilitates collaborative identification of poisoned samples across clients.

Abstract

Federated Instruction Tuning (FIT) enables collaborative instruction tuning of large language models across multiple organizations (clients) in a cross-silo setting without requiring the sharing of private instructions. Recent findings on natural backdoors and the existing training data collection method suggest that poisoned samples may be pervasive and inadvertently embedded in real-world datasets, potentially distributed across all clients, even if the clients are benign. This work systematically examine this threat in FIT, demonstrating that existing defenses are ineffective when poisoned data is interspersed among all clients. Addressing this challenge entails two major difficulties: identifying the distinctive characteristics of poisoned samples at each client and enabling collaborative defense when some clients are heavily dominated by poisoned samples. To address these difficulties, we identify gradients in the frequency domain as a robust signal to distinguish poisoned data. We further propose a global secondary clustering mechanism that facilitates collaborative identification of poisoned samples across clients. In summary, this paper introduces ProtegoFed, the first backdoor-free FIT framework that accurately detects, removes, and even purifies interspersed poisoned data across clients during the training. Experimental results on four FL datasets show that ProtegoFed identifies $92.00\% \sim 100.00\%$ of poisoned samples, reduces the attack success rate to almost zero, and maintains utility on the main task. Code is available at https://github.com/dongdongzhaoUP/ProtegoFed.

ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data

TL;DR

Abstract

of poisoned samples, reduces the attack success rate to almost zero, and maintains utility on the main task. Code is available at https://github.com/dongdongzhaoUP/ProtegoFed.

Paper Structure (45 sections, 1 theorem, 15 equations, 13 figures, 12 tables)

This paper contains 45 sections, 1 theorem, 15 equations, 13 figures, 12 tables.

Introduction
Preliminaries
Federated Instruction Tuning
Backdoor Attacks in LLMs
Learning Mechanisms of Backdoor in the Frequency Space
Threat Model
Adversary Model
Defense Objectives and Assumptions
Defense objectives
Defender’s capabilities and knowledge
Backdoor Vulnerability in FIT
System Design
Overview
Intra-Client Frequency-based Clustering
Global Secondary Clustering
...and 30 more sections

Key Result

Theorem 1

Let $\mathbf{M}^*$ be the optimal global model, each client performs $E$ steps of local training, $\kappa=\frac{L}{\mu}$, $\gamma = \max\left\{8\kappa, E\right\}$, $B = \sum_{k=1}^{K} (\frac{n_k}{n})^2 \sigma_k^2 + 6LT + 8(E-1)^2G^2$, $C = \frac{4}{K} E^2G^2$, and $\mathbf{E} = \mathbb{E}\left[f(\ma

Figures (13)

Figure 1: An illustration of the backdoor risk caused by untrusted training data in FIT. Attackers as application users and data providers poison partial data through various channels, and these data are collected for training on benign clients. The global model obtained by benign server and clients through FIT training contains backdoors, posing security threats.
Figure 2: The general structure of LoRA-adapted LLM and the composition and principle of LoRA module.
Figure 3: The impact of the number of poisoned samples on ASR of final global model in FL. (a) The impact of the poison ratio in each client on ASR in the IID scenario (all clients have the same proportion of poisoned samples); (b) The proportion of clients with poisoned samples to the total number of clients when the poison ratio of each client with poisoned samples is 10%.
Figure 4: (a): The visualization of directly applying GraCeFul locally. The experiment is conducted on WebQA using Vicuna-7B. Two types of representative defects are shown in figure. The left sub-figure is the ground truth, and the right is the predicted labels. On Client 0, poisoned samples are completely unrecognized, and on Client 1 there are many false positives on the clean data. (b): The visualization of globally integrated local and global centroids.
Figure 5: Overall workflow of $\mathsf{ProtegoFed}$ before training begins. Data obtained from third-party platforms is often unreliable. $\mathsf{ProtegoFed}$ solves the risk of poisoned samples at one time with very little overhead. In $\mathsf{ProtegoFed}$, each client applies DCT to obtain the frequency domain characteristics of samples, and performs dimensionality reduction for cluster. Then, the local centroid is calculated and sent to the server through the index retrieval of the main cluster.
...and 8 more figures

Theorems & Definitions (2)

Theorem 1
proof

ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data

TL;DR

Abstract

ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (2)