Balanced Multi-modal Federated Learning via Cross-Modal Infiltration

Yunfeng Fan; Wenchao Xu; Haozhao Wang; Jiaqi Zhu; Song Guo

Balanced Multi-modal Federated Learning via Cross-Modal Infiltration

Yunfeng Fan, Wenchao Xu, Haozhao Wang, Jiaqi Zhu, Song Guo

TL;DR

This work tackles modality imbalance in multimodal federated learning by introducing FedCMI, a cross-modal infiltration framework that transfers knowledge from the global dominant modality to weaker modalities while preserving local modality exploitation. A two-projector architecture (Self-Projector and Infiltration Projector) paired with a shared classifier enables coexistence of modality-specific and cross-modal knowledge, and a proximal term mitigates input heterogeneity in FL. The authors further introduce class-wise temperature adaptation to reduce class-level bias during distillation, improving per-class performance across clients. Empirical results on CREMA-D, AVE, and CrisisMMD show significant gains over unimodal FL baselines and existing MFL approaches under both statistical and modality-heterogeneity settings, demonstrating the framework’s effectiveness and scalability in real-world multimodal distributed learning.

Abstract

Federated learning (FL) underpins advancements in privacy-preserving distributed computing by collaboratively training neural networks without exposing clients' raw data. Current FL paradigms primarily focus on uni-modal data, while exploiting the knowledge from distributed multimodal data remains largely unexplored. Existing multimodal FL (MFL) solutions are mainly designed for statistical or modality heterogeneity from the input side, however, have yet to solve the fundamental issue,"modality imbalance", in distributed conditions, which can lead to inadequate information exploitation and heterogeneous knowledge aggregation on different modalities.In this paper, we propose a novel Cross-Modal Infiltration Federated Learning (FedCMI) framework that effectively alleviates modality imbalance and knowledge heterogeneity via knowledge transfer from the global dominant modality. To avoid the loss of information in the weak modality due to merely imitating the behavior of dominant modality, we design the two-projector module to integrate the knowledge from dominant modality while still promoting the local feature exploitation of weak modality. In addition, we introduce a class-wise temperature adaptation scheme to achieve fair performance across different classes. Extensive experiments over popular datasets are conducted and give us a gratifying confirmation of the proposed framework for fully exploring the information of each modality in MFL.

Balanced Multi-modal Federated Learning via Cross-Modal Infiltration

TL;DR

Abstract

Paper Structure (15 sections, 10 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 15 sections, 10 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Unimodal Federated Learning
Multimodal Federated Learning
Imbalanced Multimodal Learning
Method
Problem Formulation
Cross-modal Infiltration
Class-wise Temperature Adaptation
Experiments
Datasets and baselines
Implementation Details
Main Results
Ablation studies
Conclusion

Figures (5)

Figure 1: The performance of each class for different modalities on CREMA-D with vanilla local training strategy. Modality imbalance behaves differently in clients with different modalities of data and diverse data distributions (client 1 and client 3 possess both audio and visual data with different distributions. Client 2 contains only visual data with the same distribution as client 1).
Figure 2: Overall workflow of the proposed framework. For multimodal clients, they exploit the information from each modality via ground truth supervision and also absorb the knowledge from the global dominant modality to alleviate heterogeneous modality inhibition. Unimodal clients only use local data to train corresponding modules. All updated modules except the infiltration projector participate in server-client communication.
Figure 3: The performance of uni-modality in MFL on CREMA-D dataset in case A. The visual modality is extremely inhibited in baselines. Our method not only effectively improves the performance of visual modal, but also makes improvements on the overall performance.
Figure 4: The class-wise performance of visual modality from the global model on CREMA-D under case A. MFedAvg leads to particularly biased knowledge and FedCMI learns balanced performance across classes.
Figure 5: Test accuracy versus number of communication rounds for all baselines and our FedCMI. Experiments on CREMA-D under the "full/IID" setting. FedCMI converges fast and consistently outperforms strong competitors.

Balanced Multi-modal Federated Learning via Cross-Modal Infiltration

TL;DR

Abstract

Balanced Multi-modal Federated Learning via Cross-Modal Infiltration

Authors

TL;DR

Abstract

Table of Contents

Figures (5)