FedPFT: Federated Proxy Fine-Tuning of Foundation Models

Zhaopeng Peng; Xiaoliang Fan; Yufan Chen; Zheng Wang; Shirui Pan; Chenglu Wen; Ruisheng Zhang; Cheng Wang

FedPFT: Federated Proxy Fine-Tuning of Foundation Models

Zhaopeng Peng, Xiaoliang Fan, Yufan Chen, Zheng Wang, Shirui Pan, Chenglu Wen, Ruisheng Zhang, Cheng Wang

TL;DR

This work tackles privacy-preserving adaptation of foundation models under Federated Learning, where using proxy sub-FMs often yields insufficient tuning and accumulating gradient errors. The authors introduce FedPFT, combining (i) layer-wise FFN compression to build sub-FMs with preserved layer correspondence, and (ii) a two-step knowledge distillation framework (layer-level before FL fine-tuning and neuron-level during FL) to tightly align sub-FMs with the full FM and guarantee convergence. Theoretical results establish an $O(1/k)$ convergence rate under specified Lipschitz conditions and gradient-discrepancy bounds, while empirical results on BERT-base, RoBERTa-base, and ViT-base across seven datasets demonstrate that FedPFT consistently outperforms gradient-mismatch baselines and approaches full-model fine-tuning performance without sharing server FMs or client data. The approach offers a practical, privacy-preserving path to effective cross-domain FM adaptation with reduced computational and communication costs, enabling scalable deployment in NLP and CV tasks.

Abstract

Adapting Foundation Models (FMs) for downstream tasks through Federated Learning (FL) emerges a promising strategy for protecting data privacy and valuable FMs. Existing methods fine-tune FM by allocating sub-FM to clients in FL, however, leading to suboptimal performance due to insufficient tuning and inevitable error accumulations of gradients. In this paper, we propose Federated Proxy Fine-Tuning (FedPFT), a novel method enhancing FMs adaptation in downstream tasks through FL by two key modules. First, the sub-FM construction module employs a layer-wise compression approach, facilitating comprehensive FM fine-tuning across all layers by emphasizing those crucial neurons. Second, the sub-FM alignment module conducts a two-step distillations-layer-level and neuron-level-before and during FL fine-tuning respectively, to reduce error of gradient by accurately aligning sub-FM with FM under theoretical guarantees. Experimental results on seven commonly used datasets (i.e., four text and three vision) demonstrate the superiority of FedPFT.

FedPFT: Federated Proxy Fine-Tuning of Foundation Models

TL;DR

convergence rate under specified Lipschitz conditions and gradient-discrepancy bounds, while empirical results on BERT-base, RoBERTa-base, and ViT-base across seven datasets demonstrate that FedPFT consistently outperforms gradient-mismatch baselines and approaches full-model fine-tuning performance without sharing server FMs or client data. The approach offers a practical, privacy-preserving path to effective cross-domain FM adaptation with reduced computational and communication costs, enabling scalable deployment in NLP and CV tasks.

Abstract

Paper Structure (43 sections, 2 theorems, 33 equations, 4 figures, 11 tables)

This paper contains 43 sections, 2 theorems, 33 equations, 4 figures, 11 tables.

Introduction
Related Works
FM Fine-tuning through FL
FM Fine-tuning without using the full model
FedPFT
Preliminary
Federated Learning
Foundation Model Fine-tuning
Problem Definition
Method Overview
Sub-FM Construction Module based on Layer-wise Compression
Sub-FM Alignment Module based on Two-step Knowledge Distillation
Layer-level distillation before FL fine-tuning
Neuron-level distillation during FL fine-tuning
Cost Analysis
...and 28 more sections

Key Result

Theorem 1

Suppose both the function $f: \mathbb{R}^{n}\rightarrow \mathbb{R}$ and its approximation $f': \mathbb{R}^{n}\rightarrow \mathbb{R}$ are convex and differentiable, and their gradient are Lipschitz continuous with constant $L_{1} \textgreater 0$ and $L_{2} \textgreater 0$, respectively, i.e. we have it will yield a solution $f^{(k)}$ which satisfies where $f(x^{*})$ is the optimal value.

Figures (4)

Figure 1: A motivating example of two challenges in FM fine-tuning using proxy sub-model. (a) Existing methods constructing sub-FMs via layer-drop compression discard intermediate layers in FM, causing mismatched and insufficient fine-tuning, while FedPFT conducting layer-wise compression ensures comprehensive fine-tuning of FM; and (b) as FL fine-tuning progresses, the discrepancy between the updates made by sub-FMs and FMs grows, leading to a deviation from the ideal update direction, while FedPFT aims to mitigate this gap by accurately aligning sub-FMs and FMs.
Figure 2: The overall framework of FedPFT that enhances FMs adaptation in downstream tasks through FL by two key modules: (1) Sub-FM Construction Module constructs sub-FM by layer-wise compression to facilitate comprehensive FM fine-tuning; and (2) Sub-FM Alignment Module aligns sub-FM by two-step distillation to ensure accurate alignment between sub-FM and FM with a theoretical guarantee.
Figure 3: An example of two distillation processes
Figure 4: Visualisations of the label distributions of SST-2, QNLI, and CIFAR-10

Theorems & Definitions (6)

Theorem 1
proof
Theorem 2
proof
proof
proof

FedPFT: Federated Proxy Fine-Tuning of Foundation Models

TL;DR

Abstract

FedPFT: Federated Proxy Fine-Tuning of Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (6)