Towards Multi-modal Transformers in Federated Learning
Guangyu Sun, Matias Mendieta, Aritra Dutta, Xin Li, Chen Chen
TL;DR
This work tackles the problem of training high-performance multi-modal transformers under federated learning when clients hold unpaired uni-modal data, addressing data silos and privacy. It introduces transfer MFL in the vision-language domain and the FedCola framework, which combines modality-complementary local training with collaborative aggregation using a unified transformer backbone across modalities; updates per round are expressed as $\nabla \bar w^{(t+1,M)}$ via per-modality FedAvg and $\nabla w^{(t+1)}=\boldsymbol{\Omega}\nabla \bar w^{(t+1)}$, with the global update $w^{(t+1)}=w^{(t)}+\nabla w^{(t+1)}$. FedCola's core ideas are (i) decomposing models into embedding layers, transformer blocks, and heads, with shared transformer blocks and modality-specific embeddings/heads, (ii) implementing complementary local training on uni-modal clients through a gate-controlled fusion of cross-modal weights, and (iii) performing an aggregation-with-disaggregation strategy on the server to align cross-modality and in-modality knowledge using matrices like $\boldsymbol{\Omega}_{\text{server}}$ and $\boldsymbol{\Omega}_{\text{comp}}$. The approach is validated on real-world vision-language datasets (e.g., Flickr30k, COCO) with standard uni-modal datasets (CIFAR-100, AG News) and medical-domain gaps, showing FedCola consistently outperforms baselines such as FedAvg, FedProx, CreamFL, and FedIoT across diverse FL settings and domain gaps; ablation studies, fairness analyses using Shapley values, and scaling experiments further demonstrate the contributions and robustness of the framework. Overall, FedCola provides a practical, public-data-free pathway to train large multi-modal transformers in FL, offering new insights into cross-modality collaboration, convergence conditions, and the relative importance of different client types for multimodal knowledge sharing.
Abstract
Multi-modal transformers mark significant progress in different domains, but siloed high-quality data hinders their further improvement. To remedy this, federated learning (FL) has emerged as a promising privacy-preserving paradigm for training models without direct access to the raw data held by different clients. Despite its potential, a considerable research direction regarding the unpaired uni-modal clients and the transformer architecture in FL remains unexplored. To fill this gap, this paper explores a transfer multi-modal federated learning (MFL) scenario within the vision-language domain, where clients possess data of various modalities distributed across different datasets. We systematically evaluate the performance of existing methods when a transformer architecture is utilized and introduce a novel framework called Federated modality complementary and collaboration (FedCola) by addressing the in-modality and cross-modality gaps among clients. Through extensive experiments across various FL settings, FedCola demonstrates superior performance over previous approaches, offering new perspectives on future federated training of multi-modal transformers.
