Table of Contents
Fetching ...

FeedSign: Robust Full-parameter Federated Fine-tuning of Large Models with Extremely Low Communication Overhead of One Bit

Zhijie Cai, Haolong Chen, Guangxu Zhu

TL;DR

The paper tackles the high communication and memory costs of federated fine-tuning (FFT) for large models by introducing FeedSign, which encodes gradient information as seed-sign pairs and uses a shared PRNG to achieve per-step 1-bit uplink/download and inference-scale memory. The authors prove an exponential convergence rate $\mathcal{O}(e^{-t})$ under standard assumptions and show robustness to data heterogeneity and Byzantine attacks, outperforming prior zeroth-order FL baselines in many settings. Empirical results span models from $11$M to $13$B and tasks in NLP and vision, demonstrating that FeedSign can closely match or exceed ZO baselines while reducing communication by orders of magnitude. The approach enables a lightweight, privacy-conscious FFT workflow with orbit-based model sharing and potential DP-compatible extensions, offering practical impact for scalable deployment of large models in distributed environments.

Abstract

Federated fine-tuning (FFT) attempts to fine-tune a pre-trained model with private data from distributed clients by exchanging models rather than data under the orchestration of a parameter server (PS). To overcome the bottleneck forged by the growing communication and memory overhead for clients in such systems due to the growing model sizes, we propose \textit{FeedSign}, an FFT algorithm in which the upload and download payload for an aggregation step is exactly $1$ bit per step, while the memory overhead is squeezed to the amount needed for inference. This is realized by utilizing zeroth-order (ZO) optimizers on large models and shared pseudo-random number generators (PRNG) across devices to represent the gradient estimates as seed-sign pairs. We conduct theoretical analysis on FeedSign and show that it converges at an exponential rate $\mathcal{O}(e^{-t})$, where $t$ is the number of elapsed steps under widely used assumptions. Moreover, FeedSign is found to be robust against data heterogeneity and Byzantine attacks. We conducted extensive experiments on models across different structures and sizes (11M to 13B) and found that the proposed method performs better or closely, depending on scenarios, compared to its ZO and FO counterparts, albeit with an orders-of-magnitude lower communication overhead. We also discuss some interesting advantages as byproducts guaranteed by the minimalistic design of \textit{FeedSign}.

FeedSign: Robust Full-parameter Federated Fine-tuning of Large Models with Extremely Low Communication Overhead of One Bit

TL;DR

The paper tackles the high communication and memory costs of federated fine-tuning (FFT) for large models by introducing FeedSign, which encodes gradient information as seed-sign pairs and uses a shared PRNG to achieve per-step 1-bit uplink/download and inference-scale memory. The authors prove an exponential convergence rate under standard assumptions and show robustness to data heterogeneity and Byzantine attacks, outperforming prior zeroth-order FL baselines in many settings. Empirical results span models from M to B and tasks in NLP and vision, demonstrating that FeedSign can closely match or exceed ZO baselines while reducing communication by orders of magnitude. The approach enables a lightweight, privacy-conscious FFT workflow with orbit-based model sharing and potential DP-compatible extensions, offering practical impact for scalable deployment of large models in distributed environments.

Abstract

Federated fine-tuning (FFT) attempts to fine-tune a pre-trained model with private data from distributed clients by exchanging models rather than data under the orchestration of a parameter server (PS). To overcome the bottleneck forged by the growing communication and memory overhead for clients in such systems due to the growing model sizes, we propose \textit{FeedSign}, an FFT algorithm in which the upload and download payload for an aggregation step is exactly bit per step, while the memory overhead is squeezed to the amount needed for inference. This is realized by utilizing zeroth-order (ZO) optimizers on large models and shared pseudo-random number generators (PRNG) across devices to represent the gradient estimates as seed-sign pairs. We conduct theoretical analysis on FeedSign and show that it converges at an exponential rate , where is the number of elapsed steps under widely used assumptions. Moreover, FeedSign is found to be robust against data heterogeneity and Byzantine attacks. We conducted extensive experiments on models across different structures and sizes (11M to 13B) and found that the proposed method performs better or closely, depending on scenarios, compared to its ZO and FO counterparts, albeit with an orders-of-magnitude lower communication overhead. We also discuss some interesting advantages as byproducts guaranteed by the minimalistic design of \textit{FeedSign}.

Paper Structure

This paper contains 44 sections, 7 theorems, 39 equations, 10 figures, 13 tables, 1 algorithm.

Key Result

Lemma 3.9

Given $\mathcal{L}(\boldsymbol{w})$ being a $L$-smooth function and $\hat{\nabla} \mathcal{L}(\boldsymbol{w}, \mathcal{B})$ an unbiased gradient estimator with $\mu \to 0$, the expected per-step loss descent can be bounded as follows: where characterize the low-rank effect of the gradient estimator.

Figures (10)

  • Figure 1: Overview of FedAvg and FeedSign
  • Figure 2: Loss and accuracy curve versus the number of steps elapsed under data heterogeneity.
  • Figure 3: Loss and accuracy curve versus number of steps elapsed under Byzantine attacks, with a bigger client pool size.
  • Figure 4: Loss and accuracy curve versus number of steps elapsed under Byzantine attacks.
  • Figure 5: Orbits-based efficient model storage and sharing.
  • ...and 5 more figures

Theorems & Definitions (25)

  • Definition 3.1: Client Update
  • Definition 3.2: Update Aggregation
  • Remark 3.3
  • Lemma 3.9: Dimension-free Descent Lemma for ZO-FedSGD, malladi2023fine
  • Remark 3.10
  • Theorem 3.11: Global Convergence for FedSGD, ZO-FedSGD, and FeedSign
  • Remark 3.12
  • Remark 3.13
  • Remark 3.14
  • Remark 4.1
  • ...and 15 more