Table of Contents
Fetching ...

Replication Study: Federated Text-Driven Prompt Generation for Vision-Language Models

Suraj Prasad, Anubha Pant

TL;DR

The paper tackles the challenge of generalizing vision-language models in federated settings where data is non-IID and privacy-preserving. It analyzes FedTPG, a text-driven prompt generation approach via a PromptTranslator conditioned on class name embeddings, trained with FedAvg. The replication demonstrates results within 0.2% of the original across six diverse datasets, with an average unseen-class generalization gain of +1.43 percentage points, supporting claims that text-driven prompts improve cross-class generalization and that federated training preserves strong performance across domains. This work underscores the robustness, efficiency, and privacy-preserving potential of FedTPG for adapting large pretrained vision-language models to distributed, real-world scenarios.

Abstract

Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities, yet their adaptation to federated learning scenarios presents significant challenges, particularly regarding generalization to unseen classes. The original FedTPG paper \cite{Qiu2024} addresses this limitation by introducing a text driven prompt generation network that dynamically creates prompts conditioned on class names, enabling better cross-class generalization in federated settings. In this work, we present a faithful replication study of FedTPG, evaluating the pre-trained model on six diverse vision datasets: Caltech101, Oxford Flowers, FGVC Aircraft, Oxford Pets, Food-101, and DTD. Our evaluation achieves results within 0.2\% of the original paper's reported accuracies, with an average accuracy of 74.58\% on seen (base) classes and 76.00\% on unseen (new) classes, demonstrating a +1.43 percentage point improvement in generalization. These results validate the original paper's core claims: (1) text-driven prompt generation enables superior generalization to unseen classes compared to static prompt learning methods, and (2) federated training of prompt generators maintains high performance across diverse visual domains without sharing private data. Our successful replication confirms the robustness and reproducibility of the FedTPG approach.

Replication Study: Federated Text-Driven Prompt Generation for Vision-Language Models

TL;DR

The paper tackles the challenge of generalizing vision-language models in federated settings where data is non-IID and privacy-preserving. It analyzes FedTPG, a text-driven prompt generation approach via a PromptTranslator conditioned on class name embeddings, trained with FedAvg. The replication demonstrates results within 0.2% of the original across six diverse datasets, with an average unseen-class generalization gain of +1.43 percentage points, supporting claims that text-driven prompts improve cross-class generalization and that federated training preserves strong performance across domains. This work underscores the robustness, efficiency, and privacy-preserving potential of FedTPG for adapting large pretrained vision-language models to distributed, real-world scenarios.

Abstract

Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities, yet their adaptation to federated learning scenarios presents significant challenges, particularly regarding generalization to unseen classes. The original FedTPG paper \cite{Qiu2024} addresses this limitation by introducing a text driven prompt generation network that dynamically creates prompts conditioned on class names, enabling better cross-class generalization in federated settings. In this work, we present a faithful replication study of FedTPG, evaluating the pre-trained model on six diverse vision datasets: Caltech101, Oxford Flowers, FGVC Aircraft, Oxford Pets, Food-101, and DTD. Our evaluation achieves results within 0.2\% of the original paper's reported accuracies, with an average accuracy of 74.58\% on seen (base) classes and 76.00\% on unseen (new) classes, demonstrating a +1.43 percentage point improvement in generalization. These results validate the original paper's core claims: (1) text-driven prompt generation enables superior generalization to unseen classes compared to static prompt learning methods, and (2) federated training of prompt generators maintains high performance across diverse visual domains without sharing private data. Our successful replication confirms the robustness and reproducibility of the FedTPG approach.
Paper Structure (28 sections, 2 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Error rates by dataset and split. Most datasets show similar error rates for base (blue) and new (orange) classes, with Aircraft being the most challenging.
  • Figure 2: Left: Performance comparison of base vs. new classes. Right: Generalization gap showing which datasets benefit from positive transfer (green) vs. negative transfer (red).