Balanced Multi-modal Federated Learning via Cross-Modal Infiltration
Yunfeng Fan, Wenchao Xu, Haozhao Wang, Jiaqi Zhu, Song Guo
TL;DR
This work tackles modality imbalance in multimodal federated learning by introducing FedCMI, a cross-modal infiltration framework that transfers knowledge from the global dominant modality to weaker modalities while preserving local modality exploitation. A two-projector architecture (Self-Projector and Infiltration Projector) paired with a shared classifier enables coexistence of modality-specific and cross-modal knowledge, and a proximal term mitigates input heterogeneity in FL. The authors further introduce class-wise temperature adaptation to reduce class-level bias during distillation, improving per-class performance across clients. Empirical results on CREMA-D, AVE, and CrisisMMD show significant gains over unimodal FL baselines and existing MFL approaches under both statistical and modality-heterogeneity settings, demonstrating the framework’s effectiveness and scalability in real-world multimodal distributed learning.
Abstract
Federated learning (FL) underpins advancements in privacy-preserving distributed computing by collaboratively training neural networks without exposing clients' raw data. Current FL paradigms primarily focus on uni-modal data, while exploiting the knowledge from distributed multimodal data remains largely unexplored. Existing multimodal FL (MFL) solutions are mainly designed for statistical or modality heterogeneity from the input side, however, have yet to solve the fundamental issue,"modality imbalance", in distributed conditions, which can lead to inadequate information exploitation and heterogeneous knowledge aggregation on different modalities.In this paper, we propose a novel Cross-Modal Infiltration Federated Learning (FedCMI) framework that effectively alleviates modality imbalance and knowledge heterogeneity via knowledge transfer from the global dominant modality. To avoid the loss of information in the weak modality due to merely imitating the behavior of dominant modality, we design the two-projector module to integrate the knowledge from dominant modality while still promoting the local feature exploitation of weak modality. In addition, we introduce a class-wise temperature adaptation scheme to achieve fair performance across different classes. Extensive experiments over popular datasets are conducted and give us a gratifying confirmation of the proposed framework for fully exploring the information of each modality in MFL.
