Table of Contents
Fetching ...

FedPromo: Federated Lightweight Proxy Models at the Edge Bring New Domains to Foundation Models

Matteo Caligiuri, Francesco Barbato, Donald Shenaj, Umberto Michieli, Pietro Zanuttigh

TL;DR

FedPromo tackles the challenge of adapting large foundation models to edge devices under privacy constraints by introducing Cross-Architecture Federated Knowledge Transfer (CA-FKT). It first performs server-side cross-architectural distillation to align a lightweight client encoder with a powerful server encoder, then traps federated learning to training only a local classifier on clients, while a translator maps client features to the server space. The framework employs inactive-classes preservation and class de-biasing to maintain stability under non-IID, multi-domain data and to minimize cross-class interference during aggregation. Across five domain pairs, FedPromo consistently outperforms strong baselines, approaching centralized performance especially when in-domain pretraining is available, and demonstrates strong privacy-preserving and resource-efficient properties for edge-domain adaptation of foundation models.

Abstract

Federated Learning (FL) is an established paradigm for training deep learning models on decentralized data. However, as the size of the models grows, conventional FL approaches often require significant computational resources on client devices, which may not be feasible. We introduce FedPromo, a novel framework that enables efficient adaptation of large-scale foundation models stored on a central server to new domains encountered only by remote clients. Instead of directly training the large model on client devices, FedPromo optimizes lightweight proxy models via FL, significantly reducing computational overhead while maintaining privacy. Our method follows a two-stage process: first, server-side knowledge distillation aligns the representations of a large-scale foundation model (e.g., a transformer) with those of a compact counterpart (e.g., a CNN). Then, the compact model encoder is deployed to client devices, where trainable classifiers are learned locally. These classifiers are subsequently aggregated and seamlessly transferred back to the foundation model, facilitating personalized adaptation without requiring direct access to user data. Through novel regularization strategies, our framework enables decentralized multi-domain learning, balancing performance, privacy, and resource efficiency. Extensive experiments on five image classification benchmarks demonstrate that FedPromo outperforms existing methods while assuming limited-resource clients.

FedPromo: Federated Lightweight Proxy Models at the Edge Bring New Domains to Foundation Models

TL;DR

FedPromo tackles the challenge of adapting large foundation models to edge devices under privacy constraints by introducing Cross-Architecture Federated Knowledge Transfer (CA-FKT). It first performs server-side cross-architectural distillation to align a lightweight client encoder with a powerful server encoder, then traps federated learning to training only a local classifier on clients, while a translator maps client features to the server space. The framework employs inactive-classes preservation and class de-biasing to maintain stability under non-IID, multi-domain data and to minimize cross-class interference during aggregation. Across five domain pairs, FedPromo consistently outperforms strong baselines, approaching centralized performance especially when in-domain pretraining is available, and demonstrates strong privacy-preserving and resource-efficient properties for edge-domain adaptation of foundation models.

Abstract

Federated Learning (FL) is an established paradigm for training deep learning models on decentralized data. However, as the size of the models grows, conventional FL approaches often require significant computational resources on client devices, which may not be feasible. We introduce FedPromo, a novel framework that enables efficient adaptation of large-scale foundation models stored on a central server to new domains encountered only by remote clients. Instead of directly training the large model on client devices, FedPromo optimizes lightweight proxy models via FL, significantly reducing computational overhead while maintaining privacy. Our method follows a two-stage process: first, server-side knowledge distillation aligns the representations of a large-scale foundation model (e.g., a transformer) with those of a compact counterpart (e.g., a CNN). Then, the compact model encoder is deployed to client devices, where trainable classifiers are learned locally. These classifiers are subsequently aggregated and seamlessly transferred back to the foundation model, facilitating personalized adaptation without requiring direct access to user data. Through novel regularization strategies, our framework enables decentralized multi-domain learning, balancing performance, privacy, and resource efficiency. Extensive experiments on five image classification benchmarks demonstrate that FedPromo outperforms existing methods while assuming limited-resource clients.

Paper Structure

This paper contains 24 sections, 5 equations, 15 figures, 11 tables, 4 algorithms.

Figures (15)

  • Figure 1: FedPromo enables large-scale model adaptation via Federated Learning through compact proxy models. By combining cross-architectural distillation with specialized training strategies, it achieves private, decentralized, multi-domain adaptation to data on resource-constrained clients.
  • Figure 2: FedPromo employs a two-stage architecture. The top row illustrates the cross-architectural distillation pretraining, where we transfer features useful for classification by training the decoder $D$ only on Oracle features, and keeping it frozen when the student feature vector $\mathbf{F}$ is in input. We also add an extra knowledge distillation loss acting directly on the features. The bottom row summarizes the federated setup, depicting decentralized training on two sample clients and subsequent model aggregation on the server.
  • Figure 3: t-SNE plot showing the alignment of the features after pretraining. For clarity, it shows the embeddings of just 10 classes. The full plot with all classes is in the Supp. Mat.
  • Figure 4: Self-similarity plots of the CompCars classifier weights trained with CDB off (left) and with CDB on (right). Our CDB block drastically reduces cross-talk between classes.
  • Figure S.1: Clipping norm scheduler for the CompCars dataset
  • ...and 10 more figures