Table of Contents
Fetching ...

Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data

Thu Hang Phung, Duong M. Nguyen, Thanh Trung Huynh, Quoc Viet Hung Nguyen, Trong Nghia Hoang, Phi Le Nguyen

TL;DR

The paper tackles the challenge of fine-tuning large pre-trained models in federated settings where clients hold heterogeneous and incomplete multimodal data. It introduces FED-PRIME, a framework that splits tuning prompts into inter-client and intra-client sets, enabling input-level alignment and cross-client aggregation while preserving local modality patterns. A local input-adaptive retrieval mechanism and a server-side clustering-based alignment (with a Hungarian-algorithm solution and a popularity regularizer) enable effective knowledge sharing across diverse missing-data patterns. Empirical results on MM-IMDB and UPMC Food-101 demonstrate state-of-the-art performance across multiple missing-modality scenarios, along with robustness to missing rates and faster convergence relative to baselines. This work advances practical federated multimodal learning by enabling semantically aligned, scalable prompt-tuning across heterogeneous client data without centralizing private multimodal datasets.

Abstract

This paper introduces a generalized federated prompt-tuning framework for practical scenarios where local datasets are multi-modal and exhibit different distributional patterns of missing features at the input level. The proposed framework bridges the gap between federated learning and multi-modal prompt-tuning which have traditionally focused on either uni-modal or centralized data. A key challenge in this setting arises from the lack of semantic alignment between prompt instructions that encode similar distributional patterns of missing data across different clients. To address this, our framework introduces specialized client-tuning and server-aggregation designs that simultaneously optimize, align, and aggregate prompt-tuning instructions across clients and data modalities. This allows prompt instructions to complement one another and be combined effectively. Extensive evaluations on diverse multimodal benchmark datasets demonstrate that our work consistently outperforms state-of-the-art (SOTA) baselines.

Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data

TL;DR

The paper tackles the challenge of fine-tuning large pre-trained models in federated settings where clients hold heterogeneous and incomplete multimodal data. It introduces FED-PRIME, a framework that splits tuning prompts into inter-client and intra-client sets, enabling input-level alignment and cross-client aggregation while preserving local modality patterns. A local input-adaptive retrieval mechanism and a server-side clustering-based alignment (with a Hungarian-algorithm solution and a popularity regularizer) enable effective knowledge sharing across diverse missing-data patterns. Empirical results on MM-IMDB and UPMC Food-101 demonstrate state-of-the-art performance across multiple missing-modality scenarios, along with robustness to missing rates and faster convergence relative to baselines. This work advances practical federated multimodal learning by enabling semantically aligned, scalable prompt-tuning across heterogeneous client data without centralizing private multimodal datasets.

Abstract

This paper introduces a generalized federated prompt-tuning framework for practical scenarios where local datasets are multi-modal and exhibit different distributional patterns of missing features at the input level. The proposed framework bridges the gap between federated learning and multi-modal prompt-tuning which have traditionally focused on either uni-modal or centralized data. A key challenge in this setting arises from the lack of semantic alignment between prompt instructions that encode similar distributional patterns of missing data across different clients. To address this, our framework introduces specialized client-tuning and server-aggregation designs that simultaneously optimize, align, and aggregate prompt-tuning instructions across clients and data modalities. This allows prompt instructions to complement one another and be combined effectively. Extensive evaluations on diverse multimodal benchmark datasets demonstrate that our work consistently outperforms state-of-the-art (SOTA) baselines.
Paper Structure (21 sections, 10 equations, 10 figures, 4 tables, 3 algorithms)

This paper contains 21 sections, 10 equations, 10 figures, 4 tables, 3 algorithms.

Figures (10)

  • Figure 1: Our approach with prompt alignment outperforms FedAvg prompt-tuning (w/o alignment) across settings with different data missing rates.
  • Figure 2: Overview of the proposed multi-modal federated prompt-tuning framework -- see Alg. \ref{['alg:FPT']}. Each client maintains two (learnable) sets of intra- and inter-client prompts. At the beginning of each iteration, each client performs local training via Eq. \ref{['eq:5']} and Eq. \ref{['eq:6']}. Local sets of intra- and inter-client prompts are then sent to the server for aggregation -- see Eq. \ref{['eq:8']}--Eq. \ref{['eq:12']}.
  • Figure 3: Workflow of the prompt-alignment algorithm. At each iteration, each client samples a subset of summarizing prompts using its query and key functions. The client then performs a local update, resulting in heterogeneous prompt sets, which are subsequently sent to the server to be clustered and summarized into new summarizing prompt sets for the next iteration.
  • Figure 4: Performance comparison on UMPC Food-101 under Miss Both scenarios with various missing rates.
  • Figure 5: Plots of the train/test loss convergence on Food-101 under Miss-Text. FED-PRIME converges fastest.
  • ...and 5 more figures