Table of Contents
Fetching ...

Data-Free Generalized Zero-Shot Learning

Bowen Tang, Long Yan, Jing Zhang, Qian Yu, Lu Sheng, Dong Xu

TL;DR

This work tackles data-free zero-shot learning (DFZSL), addressing privacy and copyright constraints by learning to recognize unseen classes without access to real base data. It proposes a three-stage CLIP-based framework: (1) recover base-class image features by modeling the base classifier’s CLIP feature distribution with a von Mises-Fisher (vMF) distribution, (2) align these features with textual CLIP representations via Feature-Language Prompt Tuning (FLPT), and (3) train a conditional generator to synthesize new-class features for supervised learning. The approach yields strong improvements on generalized ZSL and base-to-new generalization benchmarks, outperforming prior data-free methods and approaching data-augmented performance in many settings. By leveraging vision-language priors and a data-free feature synthesis pipeline, the method preserves privacy while enabling practical zero-shot transfer, with code available at https://github.com/ylong4/DFZSL.

Abstract

Deep learning models have the ability to extract rich knowledge from large-scale datasets. However, the sharing of data has become increasingly challenging due to concerns regarding data copyright and privacy. Consequently, this hampers the effective transfer of knowledge from existing data to novel downstream tasks and concepts. Zero-shot learning (ZSL) approaches aim to recognize new classes by transferring semantic knowledge learned from base classes. However, traditional generative ZSL methods often require access to real images from base classes and rely on manually annotated attributes, which presents challenges in terms of data restrictions and model scalability. To this end, this paper tackles a challenging and practical problem dubbed as data-free zero-shot learning (DFZSL), where only the CLIP-based base classes data pre-trained classifier is available for zero-shot classification. Specifically, we propose a generic framework for DFZSL, which consists of three main components. Firstly, to recover the virtual features of the base data, we model the CLIP features of base class images as samples from a von Mises-Fisher (vMF) distribution based on the pre-trained classifier. Secondly, we leverage the text features of CLIP as low-cost semantic information and propose a feature-language prompt tuning (FLPT) method to further align the virtual image features and textual features. Thirdly, we train a conditional generative model using the well-aligned virtual image features and corresponding semantic text features, enabling the generation of new classes features and achieve better zero-shot generalization. Our framework has been evaluated on five commonly used benchmarks for generalized ZSL, as well as 11 benchmarks for the base-to-new ZSL. The results demonstrate the superiority and effectiveness of our approach. Our code is available in https://github.com/ylong4/DFZSL

Data-Free Generalized Zero-Shot Learning

TL;DR

This work tackles data-free zero-shot learning (DFZSL), addressing privacy and copyright constraints by learning to recognize unseen classes without access to real base data. It proposes a three-stage CLIP-based framework: (1) recover base-class image features by modeling the base classifier’s CLIP feature distribution with a von Mises-Fisher (vMF) distribution, (2) align these features with textual CLIP representations via Feature-Language Prompt Tuning (FLPT), and (3) train a conditional generator to synthesize new-class features for supervised learning. The approach yields strong improvements on generalized ZSL and base-to-new generalization benchmarks, outperforming prior data-free methods and approaching data-augmented performance in many settings. By leveraging vision-language priors and a data-free feature synthesis pipeline, the method preserves privacy while enabling practical zero-shot transfer, with code available at https://github.com/ylong4/DFZSL.

Abstract

Deep learning models have the ability to extract rich knowledge from large-scale datasets. However, the sharing of data has become increasingly challenging due to concerns regarding data copyright and privacy. Consequently, this hampers the effective transfer of knowledge from existing data to novel downstream tasks and concepts. Zero-shot learning (ZSL) approaches aim to recognize new classes by transferring semantic knowledge learned from base classes. However, traditional generative ZSL methods often require access to real images from base classes and rely on manually annotated attributes, which presents challenges in terms of data restrictions and model scalability. To this end, this paper tackles a challenging and practical problem dubbed as data-free zero-shot learning (DFZSL), where only the CLIP-based base classes data pre-trained classifier is available for zero-shot classification. Specifically, we propose a generic framework for DFZSL, which consists of three main components. Firstly, to recover the virtual features of the base data, we model the CLIP features of base class images as samples from a von Mises-Fisher (vMF) distribution based on the pre-trained classifier. Secondly, we leverage the text features of CLIP as low-cost semantic information and propose a feature-language prompt tuning (FLPT) method to further align the virtual image features and textual features. Thirdly, we train a conditional generative model using the well-aligned virtual image features and corresponding semantic text features, enabling the generation of new classes features and achieve better zero-shot generalization. Our framework has been evaluated on five commonly used benchmarks for generalized ZSL, as well as 11 benchmarks for the base-to-new ZSL. The results demonstrate the superiority and effectiveness of our approach. Our code is available in https://github.com/ylong4/DFZSL
Paper Structure (13 sections, 9 equations, 2 figures, 3 tables)

This paper contains 13 sections, 9 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The proposed framework is based on vision-language pre-trained models, such as CLIP. (a) Stage 1: Model the distribution of base class image features properly and then sample virtual image features. (b) Stage 2: Align the obtained virtual image features with the extracted text features via FLPT (Feature-Language Prompt Tuning).
  • Figure :