Table of Contents
Fetching ...

FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation

Min Tan, Junchao Ma, Yinfu Feng, Jiajun Ding, Wenwen Pan, Tingting Han, Qian Zheng, Zhenzhong Kuang, Zhou Yu

TL;DR

This work proposes FedAFD, a unified MFL framework that enhances client and server learning and introduces a bi-level adversarial alignment strategy to align local and global representations within and across modalities, mitigating modality and task gaps.

Abstract

Multimodal Federated Learning (MFL) enables clients with heterogeneous data modalities to collaboratively train models without sharing raw data, offering a privacy-preserving framework that leverages complementary cross-modal information. However, existing methods often overlook personalized client performance and struggle with modality/task discrepancies, as well as model heterogeneity. To address these challenges, we propose FedAFD, a unified MFL framework that enhances client and server learning. On the client side, we introduce a bi-level adversarial alignment strategy to align local and global representations within and across modalities, mitigating modality and task gaps. We further design a granularity-aware fusion module to integrate global knowledge into the personalized features adaptively. On the server side, to handle model heterogeneity, we propose a similarity-guided ensemble distillation mechanism that aggregates client representations on shared public data based on feature similarity and distills the fused knowledge into the global model. Extensive experiments conducted under both IID and non-IID settings demonstrate that FedAFD achieves superior performance and efficiency for both the client and the server.

FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation

TL;DR

This work proposes FedAFD, a unified MFL framework that enhances client and server learning and introduces a bi-level adversarial alignment strategy to align local and global representations within and across modalities, mitigating modality and task gaps.

Abstract

Multimodal Federated Learning (MFL) enables clients with heterogeneous data modalities to collaboratively train models without sharing raw data, offering a privacy-preserving framework that leverages complementary cross-modal information. However, existing methods often overlook personalized client performance and struggle with modality/task discrepancies, as well as model heterogeneity. To address these challenges, we propose FedAFD, a unified MFL framework that enhances client and server learning. On the client side, we introduce a bi-level adversarial alignment strategy to align local and global representations within and across modalities, mitigating modality and task gaps. We further design a granularity-aware fusion module to integrate global knowledge into the personalized features adaptively. On the server side, to handle model heterogeneity, we propose a similarity-guided ensemble distillation mechanism that aggregates client representations on shared public data based on feature similarity and distills the fused knowledge into the global model. Extensive experiments conducted under both IID and non-IID settings demonstrate that FedAFD achieves superior performance and efficiency for both the client and the server.
Paper Structure (22 sections, 10 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 10 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: The challenges of multimodal federated learning
  • Figure 2: Overview of our proposed FedAFD comprising three steps: ① Server is trained on a public dataset and extracts global public features. ② With delivered global representations and encoders, clients train local models on private data by granularity-aware local and global fusion, enhanced with bi-level adversarial feature alignment. ③ Clients extract local public features of the public dataset and then sends it to the server. The server performs adaptive aggregation and updates the global model via similarity-guided ensemble distillation.
  • Figure 3: T-SNE analysis of feature discrepancy on public data. Locally-trained and FedAFD encoders are compared.
  • Figure 4: T-SNE visualization of different features on CIFAR-100, in terms of both image category and feature types.