Towards Multi-modal Transformers in Federated Learning

Guangyu Sun; Matias Mendieta; Aritra Dutta; Xin Li; Chen Chen

Towards Multi-modal Transformers in Federated Learning

Guangyu Sun, Matias Mendieta, Aritra Dutta, Xin Li, Chen Chen

TL;DR

This work tackles the problem of training high-performance multi-modal transformers under federated learning when clients hold unpaired uni-modal data, addressing data silos and privacy. It introduces transfer MFL in the vision-language domain and the FedCola framework, which combines modality-complementary local training with collaborative aggregation using a unified transformer backbone across modalities; updates per round are expressed as $\nabla \bar w^{(t+1,M)}$ via per-modality FedAvg and $\nabla w^{(t+1)}=\boldsymbol{\Omega}\nabla \bar w^{(t+1)}$, with the global update $w^{(t+1)}=w^{(t)}+\nabla w^{(t+1)}$. FedCola's core ideas are (i) decomposing models into embedding layers, transformer blocks, and heads, with shared transformer blocks and modality-specific embeddings/heads, (ii) implementing complementary local training on uni-modal clients through a gate-controlled fusion of cross-modal weights, and (iii) performing an aggregation-with-disaggregation strategy on the server to align cross-modality and in-modality knowledge using matrices like $\boldsymbol{\Omega}_{\text{server}}$ and $\boldsymbol{\Omega}_{\text{comp}}$. The approach is validated on real-world vision-language datasets (e.g., Flickr30k, COCO) with standard uni-modal datasets (CIFAR-100, AG News) and medical-domain gaps, showing FedCola consistently outperforms baselines such as FedAvg, FedProx, CreamFL, and FedIoT across diverse FL settings and domain gaps; ablation studies, fairness analyses using Shapley values, and scaling experiments further demonstrate the contributions and robustness of the framework. Overall, FedCola provides a practical, public-data-free pathway to train large multi-modal transformers in FL, offering new insights into cross-modality collaboration, convergence conditions, and the relative importance of different client types for multimodal knowledge sharing.

Abstract

Multi-modal transformers mark significant progress in different domains, but siloed high-quality data hinders their further improvement. To remedy this, federated learning (FL) has emerged as a promising privacy-preserving paradigm for training models without direct access to the raw data held by different clients. Despite its potential, a considerable research direction regarding the unpaired uni-modal clients and the transformer architecture in FL remains unexplored. To fill this gap, this paper explores a transfer multi-modal federated learning (MFL) scenario within the vision-language domain, where clients possess data of various modalities distributed across different datasets. We systematically evaluate the performance of existing methods when a transformer architecture is utilized and introduce a novel framework called Federated modality complementary and collaboration (FedCola) by addressing the in-modality and cross-modality gaps among clients. Through extensive experiments across various FL settings, FedCola demonstrates superior performance over previous approaches, offering new perspectives on future federated training of multi-modal transformers.

Towards Multi-modal Transformers in Federated Learning

TL;DR

via per-modality FedAvg and

, with the global update

. FedCola's core ideas are (i) decomposing models into embedding layers, transformer blocks, and heads, with shared transformer blocks and modality-specific embeddings/heads, (ii) implementing complementary local training on uni-modal clients through a gate-controlled fusion of cross-modal weights, and (iii) performing an aggregation-with-disaggregation strategy on the server to align cross-modality and in-modality knowledge using matrices like

and

. The approach is validated on real-world vision-language datasets (e.g., Flickr30k, COCO) with standard uni-modal datasets (CIFAR-100, AG News) and medical-domain gaps, showing FedCola consistently outperforms baselines such as FedAvg, FedProx, CreamFL, and FedIoT across diverse FL settings and domain gaps; ablation studies, fairness analyses using Shapley values, and scaling experiments further demonstrate the contributions and robustness of the framework. Overall, FedCola provides a practical, public-data-free pathway to train large multi-modal transformers in FL, offering new insights into cross-modality collaboration, convergence conditions, and the relative importance of different client types for multimodal knowledge sharing.

Abstract

Paper Structure (35 sections, 1 theorem, 13 equations, 12 figures, 10 tables)

This paper contains 35 sections, 1 theorem, 13 equations, 12 figures, 10 tables.

Introduction
Related Work
Transfer MFL
Method
The framework
Complementary local training against cross-modality gap
Collaborative aggregation against in-modality gap
Experiments
Evaluation under different FL settings
Evaluation under different domain gaps
Discussion
Ablation study
Fairness analysis
Scaling-up capability
Conclusion
...and 20 more sections

Key Result

theorem thmcountertheorem

For each $j\in[M]$, let $F_j$ satisfies Assumptions ass:minimum-ass:positive_eigenvalue. Then

Figures (12)

Figure 1: The transfer multi-modal federated learning setting in the vision-language domain. The clients possess data of various modalities distributed across different datasets and different local training objectives. The server aims to collaboratively train a multi-modal transformer with the data from all clients.
Figure 2: Overview of our proposed framework, FedCola. In each round of FL, uni-modal clients download the global model for their modality along with the transformer blocks from the other modalities and perform complementary local training to address the cross-modality gap, while multi-modal clients perform standard multi-modal local training. Then, all clients send their local updates to the server for uni-aggregation and collaborative aggregation to address the in-modality gap.
Figure 3: Illustration of the self-attention (Attention) and other layers updates with and without the proposed compensation scheme. The width of the block indicates the aggregation coefficient on that client. Without the compensation, layer-level misalignment happens between self-attention and other layers, while modality-level misalignment happens between updates of each modality on multi-modal updates. With compensation, both misalignment is fixed.
Figure 4: Relative performance of each multi-modal method compared to FedAvg under different domain gaps
Figure 5: Performance of different collaborative aggregation strategies
...and 7 more figures

Theorems & Definitions (5)

remark thmcounterremark
remark thmcounterremark
remark thmcounterremark
theorem thmcountertheorem
remark thmcounterremark

Towards Multi-modal Transformers in Federated Learning

TL;DR

Abstract

Towards Multi-modal Transformers in Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (5)