Table of Contents
Fetching ...

Resource-Efficient Federated Multimodal Learning via Layer-wise and Progressive Training

Ye Lin Tun, Chu Myaet Thwal, Minh N. H. Nguyen, Choong Seon Hong

TL;DR

LW-FedMML is introduced, a layer-wise federated multimodal learning (FedMML) approach which decomposes the training process into multiple stages which significantly reducing the memory and computational requirements of the model.

Abstract

Combining different data modalities enables deep neural networks to tackle complex tasks more effectively, making multimodal learning increasingly popular. To harness multimodal data closer to end users, it is essential to integrate multimodal learning with privacy-preserving approaches like federated learning (FL). However, compared to conventional unimodal learning, multimodal setting requires dedicated encoders for each modality, resulting in larger and more complex models. Training these models requires significant resources, presenting a substantial challenge for FL clients operating with limited computation and communication resources. To address these challenges, we introduce LW-FedMML, a layer-wise federated multimodal learning approach which decomposes the training process into multiple stages. Each stage focuses on training only a portion of the model, thereby significantly reducing the memory and computational requirements. Moreover, FL clients only need to exchange the trained model portion with the central server, lowering the resulting communication cost. We conduct extensive experiments across various FL and multimodal learning settings to validate the effectiveness of our proposed method. The results demonstrate that LW-FedMML can compete with conventional end-to-end federated multimodal learning (FedMML) while significantly reducing the resource burden on FL clients. Specifically, LW-FedMML reduces memory usage by up to $2.7\times$, computational operations (FLOPs) by $2.4\times$, and total communication cost by $2.3\times$. We also explore a progressive training approach called Prog-FedMML. While it offers lesser resource efficiency than LW-FedMML, Prog-FedMML has the potential to surpass the performance of end-to-end FedMML, making it a viable option for scenarios with fewer resource constraints.

Resource-Efficient Federated Multimodal Learning via Layer-wise and Progressive Training

TL;DR

LW-FedMML is introduced, a layer-wise federated multimodal learning (FedMML) approach which decomposes the training process into multiple stages which significantly reducing the memory and computational requirements of the model.

Abstract

Combining different data modalities enables deep neural networks to tackle complex tasks more effectively, making multimodal learning increasingly popular. To harness multimodal data closer to end users, it is essential to integrate multimodal learning with privacy-preserving approaches like federated learning (FL). However, compared to conventional unimodal learning, multimodal setting requires dedicated encoders for each modality, resulting in larger and more complex models. Training these models requires significant resources, presenting a substantial challenge for FL clients operating with limited computation and communication resources. To address these challenges, we introduce LW-FedMML, a layer-wise federated multimodal learning approach which decomposes the training process into multiple stages. Each stage focuses on training only a portion of the model, thereby significantly reducing the memory and computational requirements. Moreover, FL clients only need to exchange the trained model portion with the central server, lowering the resulting communication cost. We conduct extensive experiments across various FL and multimodal learning settings to validate the effectiveness of our proposed method. The results demonstrate that LW-FedMML can compete with conventional end-to-end federated multimodal learning (FedMML) while significantly reducing the resource burden on FL clients. Specifically, LW-FedMML reduces memory usage by up to , computational operations (FLOPs) by , and total communication cost by . We also explore a progressive training approach called Prog-FedMML. While it offers lesser resource efficiency than LW-FedMML, Prog-FedMML has the potential to surpass the performance of end-to-end FedMML, making it a viable option for scenarios with fewer resource constraints.
Paper Structure (27 sections, 4 equations, 11 figures, 13 tables, 4 algorithms)

This paper contains 27 sections, 4 equations, 11 figures, 13 tables, 4 algorithms.

Figures (11)

  • Figure 1: Overview of the federated learning process. (i) The global model $M$ is distributed to clients. (ii) Each client $n$ performs training on the local dataset $D^n$. (iii) The local model $M^n$ is sent back to the server. (iv) The server performs the aggregation.
  • Figure 2: Common multimodal learning approaches given modalities $a$ and $b$. $M$ denotes the model, and $x$ denotes the input sample. (a) The instance discrimination-based approach focuses on aligning the representations of diverse modalities within a given data sample. Negative samples are omitted in the figure for clarity. (b) The supervised approach aims to fuse information from various modalities, enabling the model to make more informed predictions. Here, $H_\text{sup}$ can be any task-specific prediction head.
  • Figure 3: Overview of LW-FedMML for supervised setting at stage $s$.
  • Figure 4: Overview of LW-FedMML for the instance discrimination-based setting. The training process is divided into multiple stages $s \in [1,S]$. In stage $s$, the active layers $L_a^s$ and $L_b^s$ are depicted in green within encoders $F_a^s$ and $F_b^s$, where $a$ and $b$ represent different modalities. Prior layers within the encoders are frozen, as indicated in gray.
  • Figure 5: For training on the COCO dataset, we attach two transformer blocks to the ViT-Tiny encoder and one transformer block to the DistilBERT encoder at each stage. Similarly, for the ADVANCE dataset, we attach two transformer blocks to the ViT-Tiny encoder and one transformer block to the DistilAST encoder at each stage.
  • ...and 6 more figures