Table of Contents
Fetching ...

FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models

Pukang Ye, Junwei Luo, Xiaolei Dong, Yunbo Yang

TL;DR

FedRW tackles data duplication in privacy-sensitive federated learning for language models by replacing hard data deletion with privacy-preserving, frequency-aware sample reweighting. It introduces PPMPR, a secure, third-party-free protocol that estimates global sample frequencies using two-party PSI and parallel orchestration to achieve efficient, scalable reweighting. The approach applies a logarithmic, frequency-based weighting to token-level losses, integrating seamlessly with FedAvg to improve generalization and robustness under duplication, including non-IID settings. Empirical results show substantial preprocessing speedups (up to 28.78x) and consistent perplexity improvements (~11.42%), demonstrating FedRW’s practical impact for privacy-preserving federated LLM training.

Abstract

Data duplication within large-scale corpora often impedes large language models' (LLMs) performance and privacy. In privacy-concerned federated learning scenarios, conventional deduplication methods typically rely on trusted third parties to perform uniform deletion, risking loss of informative samples while introducing privacy vulnerabilities. To address these gaps, we propose Federated ReWeighting (FedRW), the first privacy-preserving framework, to the best of our knowledge, that performs soft deduplication via sample reweighting instead of deletion in federated LLM training, without assuming a trusted third party. At its core, FedRW proposes a secure, frequency-aware reweighting protocol through secure multi-party computation, coupled with a parallel orchestration strategy to ensure efficiency and scalability. During training, FedRW utilizes an adaptive reweighting mechanism with global sample frequencies to adjust individual loss contributions, effectively improving generalization and robustness. Empirical results demonstrate that FedRW outperforms the state-of-the-art method by achieving up to 28.78x speedup in preprocessing and approximately 11.42% improvement in perplexity, while offering enhanced security guarantees. FedRW thus establishes a new paradigm for managing duplication in federated LLM training.

FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models

TL;DR

FedRW tackles data duplication in privacy-sensitive federated learning for language models by replacing hard data deletion with privacy-preserving, frequency-aware sample reweighting. It introduces PPMPR, a secure, third-party-free protocol that estimates global sample frequencies using two-party PSI and parallel orchestration to achieve efficient, scalable reweighting. The approach applies a logarithmic, frequency-based weighting to token-level losses, integrating seamlessly with FedAvg to improve generalization and robustness under duplication, including non-IID settings. Empirical results show substantial preprocessing speedups (up to 28.78x) and consistent perplexity improvements (~11.42%), demonstrating FedRW’s practical impact for privacy-preserving federated LLM training.

Abstract

Data duplication within large-scale corpora often impedes large language models' (LLMs) performance and privacy. In privacy-concerned federated learning scenarios, conventional deduplication methods typically rely on trusted third parties to perform uniform deletion, risking loss of informative samples while introducing privacy vulnerabilities. To address these gaps, we propose Federated ReWeighting (FedRW), the first privacy-preserving framework, to the best of our knowledge, that performs soft deduplication via sample reweighting instead of deletion in federated LLM training, without assuming a trusted third party. At its core, FedRW proposes a secure, frequency-aware reweighting protocol through secure multi-party computation, coupled with a parallel orchestration strategy to ensure efficiency and scalability. During training, FedRW utilizes an adaptive reweighting mechanism with global sample frequencies to adjust individual loss contributions, effectively improving generalization and robustness. Empirical results demonstrate that FedRW outperforms the state-of-the-art method by achieving up to 28.78x speedup in preprocessing and approximately 11.42% improvement in perplexity, while offering enhanced security guarantees. FedRW thus establishes a new paradigm for managing duplication in federated LLM training.

Paper Structure

This paper contains 37 sections, 2 theorems, 10 equations, 4 figures, 14 tables, 1 algorithm.

Key Result

Theorem 1

$\Pi_\text{2PC}$ securely implements the ideal functionality $f_\text{2PC}$ in the semi-honest model.

Figures (4)

  • Figure 1: Deduplication in Federated Learning (FL). (a) Challenges of global deduplication in decentralized settings: privacy constraints prohibit direct data sharing. (b) State-of-the-art solution utilizing hard deduplication over encrypted data, requiring a trusted third party.
  • Figure 2: FedRW Framework: Parallel $\Pi_\text{2PC}$-based Reweighting for Efficient FL. The overview is divided into three parts: (Left) The parallel orchestration of the third-party-free $\Pi_\text{2PC}$ protocol. (Center) The frequency-aware reweighting scheme that dynamically assigns weights (reflected by color) to samples while preserving data integrity. (Right) A comparison between FedRW and the baseline approach.
  • Figure 3: A toy example for the parallel orchestration when $n=8$.
  • Figure 4: We evaluate the effect of client number and dataset size on protocol running time. For clients $(10\textit{-}50)$ with $2^{19}$ data per client and 30% duplication, PPMPR exhibits $17.61\textit{-}28.78\times$ speedup. For 50 clients, PPMPR outperforms the baseline by $4.09\textit{-}28.78\times$ with increasing dataset size.

Theorems & Definitions (5)

  • Definition 1: Security
  • Theorem 1
  • proof
  • Theorem 2
  • proof