Table of Contents
Fetching ...

Descriminative-Generative Custom Tokens for Vision-Language Models

Pramuditha Perera, Matthew Trager, Luca Zancato, Alessandro Achille, Stefano Soatto

TL;DR

This paper tackles the problem of learning custom tokens that enable a Vision-Language Model to perform both generation and recognition while composing effectively with natural language. It introduces a unified framework combining textual inversion and a discriminative classification objective, augmented by a subspace projection to improve compositionality. A key innovation is Generation Aided Image Retrieval (GAIR), which iteratively refines queries at inference time by balancing the learned token with attribute-based prompts to improve text-to-image retrieval, demonstrated on DeepFashion2 with notable MRR gains. The approach yields tokens that generate faithful visuals of the target concept, enable robust classification, and support interactive query refinement, offering practical gains for multimodal search and visualization of retrieved results.

Abstract

This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs). Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries. The targeted concept is specified in terms of a small set of images and a parent concept described using text. We operate on CLIP text features and propose to use a combination of a textual inversion loss and a classification loss to ensure that text features of the learned token are aligned with image features of the concept in the CLIP embedding space. We restrict the learned token to a low-dimensional subspace spanned by tokens for attributes that are appropriate for the given super-class. These modifications improve the quality of compositions of the learned token with natural language for generating new scenes. Further, we show that learned custom tokens can be used to form queries for text-to-image retrieval task, and also have the important benefit that composite queries can be visualized to ensure that the desired concept is faithfully encoded. Based on this, we introduce the method of Generation Aided Image Retrieval, where the query is modified at inference time to better suit the search intent. On the DeepFashion2 dataset, our method improves Mean Reciprocal Retrieval (MRR) over relevant baselines by 7%.

Descriminative-Generative Custom Tokens for Vision-Language Models

TL;DR

This paper tackles the problem of learning custom tokens that enable a Vision-Language Model to perform both generation and recognition while composing effectively with natural language. It introduces a unified framework combining textual inversion and a discriminative classification objective, augmented by a subspace projection to improve compositionality. A key innovation is Generation Aided Image Retrieval (GAIR), which iteratively refines queries at inference time by balancing the learned token with attribute-based prompts to improve text-to-image retrieval, demonstrated on DeepFashion2 with notable MRR gains. The approach yields tokens that generate faithful visuals of the target concept, enable robust classification, and support interactive query refinement, offering practical gains for multimodal search and visualization of retrieved results.

Abstract

This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs). Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries. The targeted concept is specified in terms of a small set of images and a parent concept described using text. We operate on CLIP text features and propose to use a combination of a textual inversion loss and a classification loss to ensure that text features of the learned token are aligned with image features of the concept in the CLIP embedding space. We restrict the learned token to a low-dimensional subspace spanned by tokens for attributes that are appropriate for the given super-class. These modifications improve the quality of compositions of the learned token with natural language for generating new scenes. Further, we show that learned custom tokens can be used to form queries for text-to-image retrieval task, and also have the important benefit that composite queries can be visualized to ensure that the desired concept is faithfully encoded. Based on this, we introduce the method of Generation Aided Image Retrieval, where the query is modified at inference time to better suit the search intent. On the DeepFashion2 dataset, our method improves Mean Reciprocal Retrieval (MRR) over relevant baselines by 7%.

Paper Structure

This paper contains 10 sections, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Learning custom tokens to represent a given set of images. Prompt tuning and textual inversion learn representations that are optimized for image recognition and generation, respectively. In this paper we examine the possibility of learning custom tokens that can be used for both generation and recognition and also compose with natural language.
  • Figure 2: Overview of the proposed method for learning a custom token for the teapot class. We learn custom tokens that can be used to generate images of the targeted concept and produce discrimination between other concepts.
  • Figure 3: Textual Inversion conditioned on CLIP does not compose well with natural language. Our custom tokens compose better with other text after subspace projection is performed.
  • Figure 4: Cosine similarity between normalized text embeddings and the learned token embedding (last column) for (i) textual inversion: TI (ii) textual inversion + cross-entropy loss: TI+CE (iii) subspace projection + cross-entropy loss: P+TI+CE (iv) norm of learned word embedding with different methods. Image intensity clipped at 0.4 for visualization.
  • Figure 5: Images generated with different captions for classes from the Textual Inversion dataset.
  • ...and 1 more figures