Sequential Compression Layers for Efficient Federated Learning in Foundational Models
Navyansh Mahla, Sunny Gupta, Amit Sethi
TL;DR
The authors address inefficiencies of LoRA in federated fine-tuning of foundational models by introducing a sequential compression layer that inserts a compact MLP between the first and second MLPs within the transformer FFN. By freezing the initial MLP and attention components and training only the compression layer and the subsequent projection, the method achieves a parameter-efficient update mechanism with a linear excess risk bound of $ abla$-flow through a controlled subspace, in contrast to LoRA’s quadratic bounds. Empirically, the approach delivers substantial performance gains on both text (MedQuAD, Dolly-15k with Gemma-2B/TinyLlama) and vision (Brain Tumor with SigLIP) tasks under highly non-IID Dirichlet partitions, outperforming FedSA-LoRA and FFA-LoRA while maintaining comparable parameter efficiency. The combination of theoretical guarantees and consistent cross-modal improvements suggests strong practical potential for privacy-preserving distributed learning of foundation models in resource-constrained FL deployments.
Abstract
Federated Learning (FL) has gained popularity for fine-tuning large language models (LLMs) across multiple nodes, each with its own private data. While LoRA has been widely adopted for parameter efficient federated fine-tuning, recent theoretical and empirical studies highlight its suboptimal performance in the federated learning context. In response, we propose a novel, simple, and more effective parameter-efficient fine-tuning method that does not rely on LoRA. Our approach introduces a small multi-layer perceptron (MLP) layer between two existing MLP layers the up proj (the FFN projection layer following the self-attention module) and down proj within the feed forward network of the transformer block. This solution addresses the bottlenecks associated with LoRA in federated fine tuning and outperforms recent LoRA-based approaches, demonstrating superior performance for both language models and vision encoders.
