FeedSign: Robust Full-parameter Federated Fine-tuning of Large Models with Extremely Low Communication Overhead of One Bit
Zhijie Cai, Haolong Chen, Guangxu Zhu
TL;DR
The paper tackles the high communication and memory costs of federated fine-tuning (FFT) for large models by introducing FeedSign, which encodes gradient information as seed-sign pairs and uses a shared PRNG to achieve per-step 1-bit uplink/download and inference-scale memory. The authors prove an exponential convergence rate $\mathcal{O}(e^{-t})$ under standard assumptions and show robustness to data heterogeneity and Byzantine attacks, outperforming prior zeroth-order FL baselines in many settings. Empirical results span models from $11$M to $13$B and tasks in NLP and vision, demonstrating that FeedSign can closely match or exceed ZO baselines while reducing communication by orders of magnitude. The approach enables a lightweight, privacy-conscious FFT workflow with orbit-based model sharing and potential DP-compatible extensions, offering practical impact for scalable deployment of large models in distributed environments.
Abstract
Federated fine-tuning (FFT) attempts to fine-tune a pre-trained model with private data from distributed clients by exchanging models rather than data under the orchestration of a parameter server (PS). To overcome the bottleneck forged by the growing communication and memory overhead for clients in such systems due to the growing model sizes, we propose \textit{FeedSign}, an FFT algorithm in which the upload and download payload for an aggregation step is exactly $1$ bit per step, while the memory overhead is squeezed to the amount needed for inference. This is realized by utilizing zeroth-order (ZO) optimizers on large models and shared pseudo-random number generators (PRNG) across devices to represent the gradient estimates as seed-sign pairs. We conduct theoretical analysis on FeedSign and show that it converges at an exponential rate $\mathcal{O}(e^{-t})$, where $t$ is the number of elapsed steps under widely used assumptions. Moreover, FeedSign is found to be robust against data heterogeneity and Byzantine attacks. We conducted extensive experiments on models across different structures and sizes (11M to 13B) and found that the proposed method performs better or closely, depending on scenarios, compared to its ZO and FO counterparts, albeit with an orders-of-magnitude lower communication overhead. We also discuss some interesting advantages as byproducts guaranteed by the minimalistic design of \textit{FeedSign}.
