Table of Contents
Fetching ...

VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition

Zaiwei Zhang, Gregory P. Meyer, Zhichao Lu, Ashish Shrivastava, Avinash Ravichandran, Eric M. Wolff

TL;DR

This work introduces an effective method to distill knowledge from an off-the-shelf vision-language model (VLM), demonstrating that it provides novel supervision in addition to those from a conventional vision-only teacher model.

Abstract

For visual recognition, knowledge distillation typically involves transferring knowledge from a large, well-trained teacher model to a smaller student model. In this paper, we introduce an effective method to distill knowledge from an off-the-shelf vision-language model (VLM), demonstrating that it provides novel supervision in addition to those from a conventional vision-only teacher model. Our key technical contribution is the development of a framework that generates novel text supervision and distills free-form text into a vision encoder. We showcase the effectiveness of our approach, termed VLM-KD, across various benchmark datasets, showing that it surpasses several state-of-the-art long-tail visual classifiers. To our knowledge, this work is the first to utilize knowledge distillation with text supervision generated by an off-the-shelf VLM and apply it to vanilla randomly initialized vision encoders.

VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition

TL;DR

This work introduces an effective method to distill knowledge from an off-the-shelf vision-language model (VLM), demonstrating that it provides novel supervision in addition to those from a conventional vision-only teacher model.

Abstract

For visual recognition, knowledge distillation typically involves transferring knowledge from a large, well-trained teacher model to a smaller student model. In this paper, we introduce an effective method to distill knowledge from an off-the-shelf vision-language model (VLM), demonstrating that it provides novel supervision in addition to those from a conventional vision-only teacher model. Our key technical contribution is the development of a framework that generates novel text supervision and distills free-form text into a vision encoder. We showcase the effectiveness of our approach, termed VLM-KD, across various benchmark datasets, showing that it surpasses several state-of-the-art long-tail visual classifiers. To our knowledge, this work is the first to utilize knowledge distillation with text supervision generated by an off-the-shelf VLM and apply it to vanilla randomly initialized vision encoders.
Paper Structure (37 sections, 4 equations, 12 figures, 9 tables)

This paper contains 37 sections, 4 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: t-SNE scatter plots of feature embeddings from sampled text generated by a VLM. Instances from long-tail classes are joining feature clusters formed by common categories. Grouped features exhibit strong semantic relevance.
  • Figure 2: Overview of VLM-KD. We first query VLMs with images and prompts to generate text supervisions for each image in the dataset. Then we encode the text with a pretrained text encoder. Note that text feature supervisions are pre-computed for the entire dataset before training. Finally, with paired image and text features, we add a constrastive loss in the original classifier training to distill text semantics into the image encoder. The text adaptor is discarded after training.
  • Figure 3: Benefits of adding more text supervision. The gray dash line indicates using only the captions generated by the General Prompt. +T1234 indicates training with captions generated by T1, T2, T3 and T4 Targeted Prompts.
  • Figure 4: Feature visualization for embeddings across 10 classes. (Best viewed in color.)
  • Figure 5: Examples of the prompts used and responses.
  • ...and 7 more figures