Table of Contents
Fetching ...

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

Philipp Allgeuer, Kyra Ahrens, Stefan Wermter

TL;DR

NOVIC addresses the challenge of unconstrained open vocabulary image classification by learning a text-only, autoregressive object decoder that inverts CLIP embeddings to output object nouns without predefined label lists. The method builds a large synthetic training corpus from a comprehensive English object noun dictionary, templated prompts, multisets, and LLM-generated captions, combined with substantial noise augmentation to bridge text and image spaces. Results on open vocabulary datasets and standard benchmarks show competitive zero-shot performance that scales with the strength of the underlying CLIP model, with strong diversity and fine-grained predictions in real time. The approach offers a practical, prompt-free alternative to traditional CLIP prompting and related baselines, enabling robust open vocabulary recognition in dynamic environments and across languages.

Abstract

We introduce NOVIC, an innovative real-time uNconstrained Open Vocabulary Image Classifier that uses an autoregressive transformer to generatively output classification labels as language. Leveraging the extensive knowledge of CLIP models, NOVIC harnesses the embedding space to enable zero-shot transfer from pure text to images. Traditional CLIP models, despite their ability for open vocabulary classification, require an exhaustive prompt of potential class labels, restricting their application to images of known content or context. To address this, we propose an "object decoder" model that is trained on a large-scale 92M-target dataset of templated object noun sets and LLM-generated captions to always output the object noun in question. This effectively inverts the CLIP text encoder and allows textual object labels from essentially the entire English language to be generated directly from image-derived embedding vectors, without requiring any a priori knowledge of the potential content of an image, and without any label biases. The trained decoders are tested on a mix of manually and web-curated datasets, as well as standard image classification benchmarks, and achieve fine-grained prompt-free prediction scores of up to 87.5%, a strong result considering the model must work for any conceivable image and without any contextual clues.

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

TL;DR

NOVIC addresses the challenge of unconstrained open vocabulary image classification by learning a text-only, autoregressive object decoder that inverts CLIP embeddings to output object nouns without predefined label lists. The method builds a large synthetic training corpus from a comprehensive English object noun dictionary, templated prompts, multisets, and LLM-generated captions, combined with substantial noise augmentation to bridge text and image spaces. Results on open vocabulary datasets and standard benchmarks show competitive zero-shot performance that scales with the strength of the underlying CLIP model, with strong diversity and fine-grained predictions in real time. The approach offers a practical, prompt-free alternative to traditional CLIP prompting and related baselines, enabling robust open vocabulary recognition in dynamic environments and across languages.

Abstract

We introduce NOVIC, an innovative real-time uNconstrained Open Vocabulary Image Classifier that uses an autoregressive transformer to generatively output classification labels as language. Leveraging the extensive knowledge of CLIP models, NOVIC harnesses the embedding space to enable zero-shot transfer from pure text to images. Traditional CLIP models, despite their ability for open vocabulary classification, require an exhaustive prompt of potential class labels, restricting their application to images of known content or context. To address this, we propose an "object decoder" model that is trained on a large-scale 92M-target dataset of templated object noun sets and LLM-generated captions to always output the object noun in question. This effectively inverts the CLIP text encoder and allows textual object labels from essentially the entire English language to be generated directly from image-derived embedding vectors, without requiring any a priori knowledge of the potential content of an image, and without any label biases. The trained decoders are tested on a mix of manually and web-curated datasets, as well as standard image classification benchmarks, and achieve fine-grained prompt-free prediction scores of up to 87.5%, a strong result considering the model must work for any conceivable image and without any contextual clues.
Paper Structure (44 sections, 13 equations, 8 figures, 6 tables)

This paper contains 44 sections, 13 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Proposed open vocabulary image classifier. The classifier processes each input image by encoding it into a CLIP embedding vector, which is then decoded into a sequence of tokens representing the object class. The object decoder is purely generative, producing free-form text without relying on a predefined list of candidate objects. This method is trained solely on textual data and is capable of zero-shot transfer to image data during inference.
  • Figure 2: Overview of the NOVIC training and inference schemes. A dataset of caption-object text pairs is generated offline using an English dictionary, prompt templates, and LLM-based caption generation. The captions are encoded offline into text embeddings, and augmented with noise to train the object decoder. The training scheme seamlessly generalizes across the large modality gap between image and text embeddings, allowing inference of arbitrary images via the image encoder. Produced classifications can be very fine-grained.
  • Figure 3: Architecture of the object decoder. The decoder-only transformer is given a sequence of linearly projected embedding vector tokens (full attention) followed by autoregressive object noun tokens (causal attention). The output sequence is converted to token logits using a linear layer that is weight-tied to the token embeddings. Cross-entropy loss with left-shifted target tokens is used during training.
  • Figure 4: Diversity comparison of predicted object nouns. Plots showing the sorted top frequency counts of the object nouns predicted for Wiki-H. Even when trained on FT9, NOVIC shows much greater diversity and has a less peaked frequency distribution. Right: The top-10 nouns for RAM are generic and overused.
  • Figure 5: Comparison of angle separation distributions. A plot of the distribution of embedding vector angle separations in 768-dimensional embedding space for matching and non-matching image-label pairs from the ImageNet-1K validation set. Also shown is the distribution of angle separations between original and noise-augmented text embeddings for both the pure Gaussian and Gaussian with uniform strategies. The dashed vertical line shows exactly perpendicular vectors that are thus mathematically uncorrelated.
  • ...and 3 more figures