Table of Contents
Fetching ...

Semantically-Prompted Language Models Improve Visual Descriptions

Michael Ogezi, Bradley Hauer, Grzegorz Kondrak

TL;DR

It is demonstrated that V-GLOSS improves visual descriptions and achieves strong results in the zero-shot setting on general and fine-grained image-classification datasets, including ImageNet, STL-10, FGVC Aircraft, and Flowers 102.

Abstract

Language-vision models like CLIP have made significant strides in vision tasks, such as zero-shot image classification (ZSIC). However, generating specific and expressive visual descriptions remains challenging; descriptions produced by current methods are often ambiguous and lacking in granularity. To tackle these issues, we propose V-GLOSS: Visual Glosses, a novel method built upon two key ideas. The first is Semantic Prompting, which conditions a language model on structured semantic knowledge. The second is a new contrastive algorithm that elicits fine-grained distinctions between similar concepts. With both ideas, we demonstrate that V-GLOSS improves visual descriptions and achieves strong results in the zero-shot setting on general and fine-grained image-classification datasets, including ImageNet, STL-10, FGVC Aircraft, and Flowers 102. Moreover, these descriptive capabilities contribute to enhancing image-generation performance. Finally, we introduce a quality-tested silver dataset with descriptions generated with V-GLOSS for all ImageNet classes.

Semantically-Prompted Language Models Improve Visual Descriptions

TL;DR

It is demonstrated that V-GLOSS improves visual descriptions and achieves strong results in the zero-shot setting on general and fine-grained image-classification datasets, including ImageNet, STL-10, FGVC Aircraft, and Flowers 102.

Abstract

Language-vision models like CLIP have made significant strides in vision tasks, such as zero-shot image classification (ZSIC). However, generating specific and expressive visual descriptions remains challenging; descriptions produced by current methods are often ambiguous and lacking in granularity. To tackle these issues, we propose V-GLOSS: Visual Glosses, a novel method built upon two key ideas. The first is Semantic Prompting, which conditions a language model on structured semantic knowledge. The second is a new contrastive algorithm that elicits fine-grained distinctions between similar concepts. With both ideas, we demonstrate that V-GLOSS improves visual descriptions and achieves strong results in the zero-shot setting on general and fine-grained image-classification datasets, including ImageNet, STL-10, FGVC Aircraft, and Flowers 102. Moreover, these descriptive capabilities contribute to enhancing image-generation performance. Finally, we introduce a quality-tested silver dataset with descriptions generated with V-GLOSS for all ImageNet classes.
Paper Structure (48 sections, 1 equation, 7 figures, 7 tables, 1 algorithm)

This paper contains 48 sections, 1 equation, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: For the dog class, we depict (a) V-GLOSS's architecture (Section \ref{['par:normal']}), along with adaptations: (b) zero-shot image classification (ZSIC) (Section \ref{['par:zsic']}) and (c) zero-shot class-conditional image generation (ZSCIG) (Section \ref{['par:zscig']})
  • Figure 2: Class descriptions for Platypus produced by one template-based method (a) and two that use LMs (b and c). Input prompts, output descriptions, and plugged values are shown.
  • Figure 3: A sample of an SKB hypernym hierarchy. For contrastive prompting, we only distinguish classes that are semantically similar to the target class, like alligator to crocodile.
  • Figure 4: V-GLOSS Accuracy vs $N$, with the number of normal fixed at 50.
  • Figure 5: Attention map for V-GLOSS description
  • ...and 2 more figures