Table of Contents
Fetching ...

Open-Vocabulary Federated Learning with Multimodal Prototyping

Huimin Zeng, Zhenrui Yue, Dong Wang

TL;DR

This work tackles open-vocabulary federated learning by enabling predictions for queries involving unseen classes without expanding centralized data sharing. It introduces Fed-MP, a CLIP-based FL framework with two key components: adaptive aggregation via client residuals that semantically weight client contributions, and multimodal prototyping that combines text prototypes with visual prototype centroids for robust open-vocabulary inference. The method achieves superior open-vocabulary generalization over baselines across six datasets, with ablations confirming the necessity of both adaptive aggregation and multimodal prototyping. Fed-MP also demonstrates favorable efficiency, scalability, and privacy-preserving characteristics, making it practically impactful for privacy-conscious, real-world FL deployments where novel concepts emerge over time.

Abstract

Existing federated learning (FL) studies usually assume the training label space and test label space are identical. However, in real-world applications, this assumption is too ideal to be true. A new user could come up with queries that involve data from unseen classes, and such open-vocabulary queries would directly defect such FL systems. Therefore, in this work, we explicitly focus on the under-explored open-vocabulary challenge in FL. That is, for a new user, the global server shall understand her/his query that involves arbitrary unknown classes. To address this problem, we leverage the pre-trained vision-language models (VLMs). In particular, we present a novel adaptation framework tailored for VLMs in the context of FL, named as Federated Multimodal Prototyping (Fed-MP). Fed-MP adaptively aggregates the local model weights based on light-weight client residuals, and makes predictions based on a novel multimodal prototyping mechanism. Fed-MP exploits the knowledge learned from the seen classes, and robustifies the adapted VLM to unseen categories. Our empirical evaluation on various datasets validates the effectiveness of Fed-MP.

Open-Vocabulary Federated Learning with Multimodal Prototyping

TL;DR

This work tackles open-vocabulary federated learning by enabling predictions for queries involving unseen classes without expanding centralized data sharing. It introduces Fed-MP, a CLIP-based FL framework with two key components: adaptive aggregation via client residuals that semantically weight client contributions, and multimodal prototyping that combines text prototypes with visual prototype centroids for robust open-vocabulary inference. The method achieves superior open-vocabulary generalization over baselines across six datasets, with ablations confirming the necessity of both adaptive aggregation and multimodal prototyping. Fed-MP also demonstrates favorable efficiency, scalability, and privacy-preserving characteristics, making it practically impactful for privacy-conscious, real-world FL deployments where novel concepts emerge over time.

Abstract

Existing federated learning (FL) studies usually assume the training label space and test label space are identical. However, in real-world applications, this assumption is too ideal to be true. A new user could come up with queries that involve data from unseen classes, and such open-vocabulary queries would directly defect such FL systems. Therefore, in this work, we explicitly focus on the under-explored open-vocabulary challenge in FL. That is, for a new user, the global server shall understand her/his query that involves arbitrary unknown classes. To address this problem, we leverage the pre-trained vision-language models (VLMs). In particular, we present a novel adaptation framework tailored for VLMs in the context of FL, named as Federated Multimodal Prototyping (Fed-MP). Fed-MP adaptively aggregates the local model weights based on light-weight client residuals, and makes predictions based on a novel multimodal prototyping mechanism. Fed-MP exploits the knowledge learned from the seen classes, and robustifies the adapted VLM to unseen categories. Our empirical evaluation on various datasets validates the effectiveness of Fed-MP.
Paper Structure (29 sections, 11 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 29 sections, 11 equations, 5 figures, 3 tables, 2 algorithms.

Figures (5)

  • Figure 1: A non open-vocabulary FL model could only return a prediction from the seen classes for an open-vocabulary query.
  • Figure 2: The training and aggregation process of Fed-MP. On clients, the adapters and residuals are trained using local data. In adaptive aggregation, only the adapter weights are aggregated.
  • Figure 3: T-SNE visualization on test classes from Caltech101.
  • Figure 4: Robustness study w.r.t. number of training samples.
  • Figure 5: Scalability study w.r.t. number of clients.