Table of Contents
Fetching ...

Generate, Transduct, Adapt: Iterative Transduction with VLMs

Oindrila Saha, Logan Lawrence, Grant Van Horn, Subhransu Maji

TL;DR

GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces and yields an average performance improvement of 8.6% and 3.7% over CLIP and transductive CLIP respectively in the zero-shot setting.

Abstract

Transductive zero-shot learning with vision-language models leverages image-image similarities within the dataset to achieve better classification accuracy compared to the inductive setting. However, there is little work that explores the structure of the language space in this context. We propose GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces. Our approach is iterative and consists of three steps: (i) incrementally exploring the attribute space by querying language models, (ii) an attribute-augmented transductive inference procedure, and (iii) fine-tuning the language and vision encoders based on inferred labels within the dataset. Through experiments with CLIP encoders, we demonstrate that GTA-CLIP, yields an average performance improvement of 8.6% and 3.7% across 12 datasets and 3 encoders, over CLIP and transductive CLIP respectively in the zero-shot setting. We also observe similar improvements in a few-shot setting. We present ablation studies that demonstrate the value of each step and visualize how the vision and language spaces evolve over iterations driven by the transductive learning. Code is released at https://github.com/cvl-umass/GTA-CLIP

Generate, Transduct, Adapt: Iterative Transduction with VLMs

TL;DR

GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces and yields an average performance improvement of 8.6% and 3.7% over CLIP and transductive CLIP respectively in the zero-shot setting.

Abstract

Transductive zero-shot learning with vision-language models leverages image-image similarities within the dataset to achieve better classification accuracy compared to the inductive setting. However, there is little work that explores the structure of the language space in this context. We propose GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces. Our approach is iterative and consists of three steps: (i) incrementally exploring the attribute space by querying language models, (ii) an attribute-augmented transductive inference procedure, and (iii) fine-tuning the language and vision encoders based on inferred labels within the dataset. Through experiments with CLIP encoders, we demonstrate that GTA-CLIP, yields an average performance improvement of 8.6% and 3.7% across 12 datasets and 3 encoders, over CLIP and transductive CLIP respectively in the zero-shot setting. We also observe similar improvements in a few-shot setting. We present ablation studies that demonstrate the value of each step and visualize how the vision and language spaces evolve over iterations driven by the transductive learning. Code is released at https://github.com/cvl-umass/GTA-CLIP
Paper Structure (34 sections, 5 equations, 7 figures, 11 tables, 1 algorithm)

This paper contains 34 sections, 5 equations, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of GTA-CLIP.(a) Vision-language models (VLMs) such as CLIP clip enable zero-shot classification using similarity between text embeddings of class prompts and images. (b) Transductive CLIP transclip exploits the structure of the entire image dataset to assign images to classes improving accuracy. (c) Our approach, GTA-CLIP, iteratively (i) induces structure over the classes in language space by generating attributes driven by the pairwise confusions, (ii) performing attribute-augmented transductive inference, and (iii) adapting CLIP encoders using the inferred labels. (d) Across 12 datasets we improve upon CLIP and transductive CLIP by 9.5% and 4.0% using VIT-B/16, and similarly for other encoders. Significant improvements are also reported in the few-shot setting.
  • Figure 2: t-SNE Plots of Class Attributes. For each category the prototype, initial set of attributes, and the final set of attributes are shown in green, blue, and red respectively. Habitat, relative characteristics, and other distinguishing features are often identified through pairwise comparisons, while the initial attributes tend to describe the prominent visual features. These plots were obtained by mapping the CLIP text embeddings of the attributes using t-SNE. Please see the Appendix for detailed figures.
  • Figure 3: Slaty-backed Gull (vs. Western Gull) Annotated T-SNE Plot.
  • Figure 4: Olive-sided Flycatcher (vs. Least Flycatcher) Annotated t-SNE Plot.
  • Figure 5: Western Wood-Pewee (vs. Least Flycatcher) Annotated t-SNE Plot.
  • ...and 2 more figures