TAP: Two-Stage Adaptive Personalization of Multi-task and Multi-Modal Foundation Models in Federated Learning
Seohyun Lee, Wenzhi Fang, Dong-Jun Han, Seyyedali Hosseinalipour, Christopher G. Brinton
TL;DR
TAP addresses personalization in federated learning for heterogeneous multi-modal and multi-task settings by introducing a two-stage approach: adaptive replacement of server-provided components on beneficial modality-task pairs during FL, followed by post-FL knowledge distillation to inject generalized server knowledge into personalized models without compromising personalization. The authors provide a convergence analysis showing the server’s limitations grow with more modality-task pairs, motivating the need for client-side personalization. Empirically, TAP demonstrates superior personalization performance across image and text tasks using FLAVA and ViLT, outperforming strong baselines and showing robust gains even under ablations of KD and margin settings. This work offers a scalable, practical framework for deploying large, multi-modal foundation models in federated environments with heterogeneous client requirements.
Abstract
Federated Learning (FL), despite demonstrating impressive capabilities in the training of multiple models in a decentralized manner, has been shown to produce a final model not necessarily well-suited to the needs of each client. While extensive work has been conducted on how to create tailored personalized models, called Personalized Federated Learning (PFL), less attention has been given to personalization via fine-tuning of foundation models with multi-task and multi-modal properties. Moreover, there exists a lack of understanding in the literature on how to fine-tune and personalize such models in a setting that is heterogeneous across clients not only in data, but also in tasks and modalities. To address this gap in the literature, we propose TAP (Two-Stage Adaptive Personalization), which (i) leverages mismatched model architectures between the clients and server to selectively conduct replacement operations when it benefits a client's local tasks and (ii) engages in post-FL knowledge distillation for capturing beneficial general knowledge without compromising personalization. We also introduce the first convergence analysis of the server model under its modality-task pair architecture, and demonstrate that as the number of modality-task pairs increases, its ability to cater to all tasks suffers. Through extensive experiments, we demonstrate the effectiveness of our proposed algorithm across a variety of datasets and tasks in comparison to a multitude of baselines. Implementation code is publicly available at https://github.com/lee3296/TAP.
