Table of Contents
Fetching ...

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

Oindrila Saha, Grant Van Horn, Subhransu Maji

TL;DR

This work tackles the challenge of poor zero-shot performance in fine-grained domains by leveraging LLM-generated category descriptions and large fine-grained image datasets to train vision-language models with bag-level supervision. The authors introduce a dataset-generation pipeline and a category-level fine-tuning objective that stochastically pairs images with text within the same class, combined with a contrastive loss and momentum-augmented encoders. They demonstrate consistent, multi-dataset improvements (approximately 4–5% on average) over strong baselines, and show that geographic and habitat priors complement visual attributes, sometimes yielding larger gains than appearance alone. A public 14-dataset benchmark is released to support ongoing research in zero-shot recognition, highlighting practical impact for fine-grained classification across diverse domains.

Abstract

The zero-shot performance of existing vision-language models (VLMs) such as CLIP is limited by the availability of large-scale, aligned image and text datasets in specific domains. In this work, we leverage two complementary sources of information -- descriptions of categories generated by large language models (LLMs) and abundant, fine-grained image classification datasets -- to improve the zero-shot classification performance of VLMs across fine-grained domains. On the technical side, we develop methods to train VLMs with this "bag-level" image-text supervision. We find that simply using these attributes at test-time does not improve performance, but our training strategy, for example, on the iNaturalist dataset, leads to an average improvement of 4-5% in zero-shot classification accuracy for novel categories of birds and flowers. Similar improvements are observed in domains where a subset of the categories was used to fine-tune the model. By prompting LLMs in various ways, we generate descriptions that capture visual appearance, habitat, and geographic regions and pair them with existing attributes such as the taxonomic structure of the categories. We systematically evaluate their ability to improve zero-shot categorization in natural domains. Our findings suggest that geographic priors can be just as effective and are complementary to visual appearance. Our method also outperforms prior work on prompt-based tuning of VLMs. We release the benchmark, consisting of 14 datasets at https://github.com/cvl-umass/AdaptCLIPZS , which will contribute to future research in zero-shot recognition.

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

TL;DR

This work tackles the challenge of poor zero-shot performance in fine-grained domains by leveraging LLM-generated category descriptions and large fine-grained image datasets to train vision-language models with bag-level supervision. The authors introduce a dataset-generation pipeline and a category-level fine-tuning objective that stochastically pairs images with text within the same class, combined with a contrastive loss and momentum-augmented encoders. They demonstrate consistent, multi-dataset improvements (approximately 4–5% on average) over strong baselines, and show that geographic and habitat priors complement visual attributes, sometimes yielding larger gains than appearance alone. A public 14-dataset benchmark is released to support ongoing research in zero-shot recognition, highlighting practical impact for fine-grained classification across diverse domains.

Abstract

The zero-shot performance of existing vision-language models (VLMs) such as CLIP is limited by the availability of large-scale, aligned image and text datasets in specific domains. In this work, we leverage two complementary sources of information -- descriptions of categories generated by large language models (LLMs) and abundant, fine-grained image classification datasets -- to improve the zero-shot classification performance of VLMs across fine-grained domains. On the technical side, we develop methods to train VLMs with this "bag-level" image-text supervision. We find that simply using these attributes at test-time does not improve performance, but our training strategy, for example, on the iNaturalist dataset, leads to an average improvement of 4-5% in zero-shot classification accuracy for novel categories of birds and flowers. Similar improvements are observed in domains where a subset of the categories was used to fine-tune the model. By prompting LLMs in various ways, we generate descriptions that capture visual appearance, habitat, and geographic regions and pair them with existing attributes such as the taxonomic structure of the categories. We systematically evaluate their ability to improve zero-shot categorization in natural domains. Our findings suggest that geographic priors can be just as effective and are complementary to visual appearance. Our method also outperforms prior work on prompt-based tuning of VLMs. We release the benchmark, consisting of 14 datasets at https://github.com/cvl-umass/AdaptCLIPZS , which will contribute to future research in zero-shot recognition.
Paper Structure (32 sections, 5 equations, 8 figures, 13 tables)

This paper contains 32 sections, 5 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Motivation. Collecting image captions in fine-grained domains requires expertise (top row), but LLMs can generate structured (e.g., shape or appearance) and accurate descriptions of categories at both the coarse (e.g., birds) and fine-grained level (e.g., Vesper Sparrow). Rich descriptions of fine-grained categories can be paired with existing datasets, such as iNaturalist van2018inaturalist and NABirds van2015building to generate coarsely-aligned image-text datasets for fine-tuning VLMs. This improves their zero-shot performance on a range of benchmarks, generalizing to novel categories and tasks.
  • Figure 2: Visualizing image-text similarity. All images within a category are sorted in order of similarity to a given text predicted by CLIP and our fine-tuned CLIPFT+A. For example, our method identifies birds which show olive-green tint on their back as the top images, whereas CLIP selects birds with visibly brown upperparts or occluded back. The image with lowest similarity which has the occluded back remains the same for both models, showing our model does not learn incorrect attribute associations even though we stochastically pair every attribute with every image during training. On the aircraft example our model predicts higher similarity to images with prominently visible fuselage. CLIP identifies the least similar image as one in which fuselage is visible, but ours chooses one where aircrafts are too far to make out the shape of fuselage.
  • Figure 3: Fine-tuning VLMs to improve zero-shot performance. a) Our framework for ① generating fine-grained attributes per class using LLMs, ② category-level fine-tuning of VLMs and ③ evaluating on a series of challenging unseen scenarios. b) We show examples of texts produced in step ①.
  • Figure 4: Example of texts produced for a category "White Spruce" of the iNaturalist dataset using GPT4-0613, example images on the right.
  • Figure 5: Example of texts produced for a category "Common Tern" of the CUB dataset using GPT4-0613, example images on the right.
  • ...and 3 more figures