Semantically-Prompted Language Models Improve Visual Descriptions

Michael Ogezi; Bradley Hauer; Grzegorz Kondrak

Semantically-Prompted Language Models Improve Visual Descriptions

Michael Ogezi, Bradley Hauer, Grzegorz Kondrak

TL;DR

It is demonstrated that V-GLOSS improves visual descriptions and achieves strong results in the zero-shot setting on general and fine-grained image-classification datasets, including ImageNet, STL-10, FGVC Aircraft, and Flowers 102.

Abstract

Language-vision models like CLIP have made significant strides in vision tasks, such as zero-shot image classification (ZSIC). However, generating specific and expressive visual descriptions remains challenging; descriptions produced by current methods are often ambiguous and lacking in granularity. To tackle these issues, we propose V-GLOSS: Visual Glosses, a novel method built upon two key ideas. The first is Semantic Prompting, which conditions a language model on structured semantic knowledge. The second is a new contrastive algorithm that elicits fine-grained distinctions between similar concepts. With both ideas, we demonstrate that V-GLOSS improves visual descriptions and achieves strong results in the zero-shot setting on general and fine-grained image-classification datasets, including ImageNet, STL-10, FGVC Aircraft, and Flowers 102. Moreover, these descriptive capabilities contribute to enhancing image-generation performance. Finally, we introduce a quality-tested silver dataset with descriptions generated with V-GLOSS for all ImageNet classes.

Semantically-Prompted Language Models Improve Visual Descriptions

TL;DR

Abstract

Paper Structure (48 sections, 1 equation, 7 figures, 7 tables, 1 algorithm)

This paper contains 48 sections, 1 equation, 7 figures, 7 tables, 1 algorithm.

Introduction
Tasks
Related Work
Language Models
Language-Vision Models
Producing Descriptions & Prompting
Method
Mapping Classes to Synsets
V-GLOSS
Normal V-GLOSS
Contrastive V-GLOSS
Evaluation
Datasets
ImageNet
CIFAR-10
...and 33 more sections

Figures (7)

Figure 1: For the dog class, we depict (a) V-GLOSS's architecture (Section \ref{['par:normal']}), along with adaptations: (b) zero-shot image classification (ZSIC) (Section \ref{['par:zsic']}) and (c) zero-shot class-conditional image generation (ZSCIG) (Section \ref{['par:zscig']})
Figure 2: Class descriptions for Platypus produced by one template-based method (a) and two that use LMs (b and c). Input prompts, output descriptions, and plugged values are shown.
Figure 3: A sample of an SKB hypernym hierarchy. For contrastive prompting, we only distinguish classes that are semantically similar to the target class, like alligator to crocodile.
Figure 4: V-GLOSS Accuracy vs $N$, with the number of normal fixed at 50.
Figure 5: Attention map for V-GLOSS description
...and 2 more figures

Semantically-Prompted Language Models Improve Visual Descriptions

TL;DR

Abstract

Semantically-Prompted Language Models Improve Visual Descriptions

Authors

TL;DR

Abstract

Table of Contents

Figures (7)