FedPFT: Federated Proxy Fine-Tuning of Foundation Models
Zhaopeng Peng, Xiaoliang Fan, Yufan Chen, Zheng Wang, Shirui Pan, Chenglu Wen, Ruisheng Zhang, Cheng Wang
TL;DR
This work tackles privacy-preserving adaptation of foundation models under Federated Learning, where using proxy sub-FMs often yields insufficient tuning and accumulating gradient errors. The authors introduce FedPFT, combining (i) layer-wise FFN compression to build sub-FMs with preserved layer correspondence, and (ii) a two-step knowledge distillation framework (layer-level before FL fine-tuning and neuron-level during FL) to tightly align sub-FMs with the full FM and guarantee convergence. Theoretical results establish an $O(1/k)$ convergence rate under specified Lipschitz conditions and gradient-discrepancy bounds, while empirical results on BERT-base, RoBERTa-base, and ViT-base across seven datasets demonstrate that FedPFT consistently outperforms gradient-mismatch baselines and approaches full-model fine-tuning performance without sharing server FMs or client data. The approach offers a practical, privacy-preserving path to effective cross-domain FM adaptation with reduced computational and communication costs, enabling scalable deployment in NLP and CV tasks.
Abstract
Adapting Foundation Models (FMs) for downstream tasks through Federated Learning (FL) emerges a promising strategy for protecting data privacy and valuable FMs. Existing methods fine-tune FM by allocating sub-FM to clients in FL, however, leading to suboptimal performance due to insufficient tuning and inevitable error accumulations of gradients. In this paper, we propose Federated Proxy Fine-Tuning (FedPFT), a novel method enhancing FMs adaptation in downstream tasks through FL by two key modules. First, the sub-FM construction module employs a layer-wise compression approach, facilitating comprehensive FM fine-tuning across all layers by emphasizing those crucial neurons. Second, the sub-FM alignment module conducts a two-step distillations-layer-level and neuron-level-before and during FL fine-tuning respectively, to reduce error of gradient by accurately aligning sub-FM with FM under theoretical guarantees. Experimental results on seven commonly used datasets (i.e., four text and three vision) demonstrate the superiority of FedPFT.
