Open-Vocabulary Federated Learning with Multimodal Prototyping

Huimin Zeng; Zhenrui Yue; Dong Wang

Open-Vocabulary Federated Learning with Multimodal Prototyping

Huimin Zeng, Zhenrui Yue, Dong Wang

TL;DR

This work tackles open-vocabulary federated learning by enabling predictions for queries involving unseen classes without expanding centralized data sharing. It introduces Fed-MP, a CLIP-based FL framework with two key components: adaptive aggregation via client residuals that semantically weight client contributions, and multimodal prototyping that combines text prototypes with visual prototype centroids for robust open-vocabulary inference. The method achieves superior open-vocabulary generalization over baselines across six datasets, with ablations confirming the necessity of both adaptive aggregation and multimodal prototyping. Fed-MP also demonstrates favorable efficiency, scalability, and privacy-preserving characteristics, making it practically impactful for privacy-conscious, real-world FL deployments where novel concepts emerge over time.

Abstract

Existing federated learning (FL) studies usually assume the training label space and test label space are identical. However, in real-world applications, this assumption is too ideal to be true. A new user could come up with queries that involve data from unseen classes, and such open-vocabulary queries would directly defect such FL systems. Therefore, in this work, we explicitly focus on the under-explored open-vocabulary challenge in FL. That is, for a new user, the global server shall understand her/his query that involves arbitrary unknown classes. To address this problem, we leverage the pre-trained vision-language models (VLMs). In particular, we present a novel adaptation framework tailored for VLMs in the context of FL, named as Federated Multimodal Prototyping (Fed-MP). Fed-MP adaptively aggregates the local model weights based on light-weight client residuals, and makes predictions based on a novel multimodal prototyping mechanism. Fed-MP exploits the knowledge learned from the seen classes, and robustifies the adapted VLM to unseen categories. Our empirical evaluation on various datasets validates the effectiveness of Fed-MP.

Open-Vocabulary Federated Learning with Multimodal Prototyping

TL;DR

Abstract

Paper Structure (29 sections, 11 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 29 sections, 11 equations, 5 figures, 3 tables, 2 algorithms.

Introduction
Related Work
Federated Learning with Domain Generalization
Federated Learning with Vision-Language Models
Preliminaries
Federated Learning
CLIP: Contrastive Language-Image Pre-training
Algorithm
Parameter-Efficient Adaptation
Client Residuals
Adaptive Model Aggregation with Client Residuals
Multimodal Prototyping
Experiments
Experimental Setup
Dataset
...and 14 more sections

Figures (5)

Figure 1: A non open-vocabulary FL model could only return a prediction from the seen classes for an open-vocabulary query.
Figure 2: The training and aggregation process of Fed-MP. On clients, the adapters and residuals are trained using local data. In adaptive aggregation, only the adapter weights are aggregated.
Figure 3: T-SNE visualization on test classes from Caltech101.
Figure 4: Robustness study w.r.t. number of training samples.
Figure 5: Scalability study w.r.t. number of clients.

Open-Vocabulary Federated Learning with Multimodal Prototyping

TL;DR

Abstract

Open-Vocabulary Federated Learning with Multimodal Prototyping

Authors

TL;DR

Abstract

Table of Contents

Figures (5)