Overcome Modal Bias in Multi-modal Federated Learning via Balanced Modality Selection
Yunfeng Fan, Wenchao Xu, Haozhao Wang, Fushuo Huo, Jinyu Chen, Song Guo
TL;DR
This work tackles modal bias in multi-modal federated learning caused by uneven modality distributions across clients. It introduces BMSFed, a framework combining a modal enhancement (ME) loss based on aggregated global prototypes with a balanced modality selection mechanism that decouples gradient contributions through two submodular objectives over multi-modal and uni-modal clients. Empirical results across CREMA-D, AVE, CG-MNIST, and ModelNet40 under IID and non-IID settings show that BMSFed outperforms baselines, improves the weak modality, and effectively mitigates global modality bias without extra communication or computation. The approach is robust to modality incongruity and scalable across data distributions, making it practically impactful for real-world multi-modal FL deployments.
Abstract
Selecting proper clients to participate in each federated learning (FL) round is critical to effectively harness a broad range of distributed data. Existing client selection methods simply consider the mining of distributed uni-modal data, yet, their effectiveness may diminish in multi-modal FL (MFL) as the modality imbalance problem not only impedes the collaborative local training but also leads to a severe global modality-level bias. We empirically reveal that local training with a certain single modality may contribute more to the global model than training with all local modalities. To effectively exploit the distributed multiple modalities, we propose a novel Balanced Modality Selection framework for MFL (BMSFed) to overcome the modal bias. On the one hand, we introduce a modal enhancement loss during local training to alleviate local imbalance based on the aggregated global prototypes. On the other hand, we propose the modality selection aiming to select subsets of local modalities with great diversity and achieving global modal balance simultaneously. Our extensive experiments on audio-visual, colored-gray, and front-back datasets showcase the superiority of BMSFed over baselines and its effectiveness in multi-modal data exploitation.
