Table of Contents
Fetching ...

Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP

Eunji Kim, Kyuhong Shim, Simyung Chang, Sungroh Yoon

TL;DR

A framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well is proposed, enabling finer control over emphasis responsive to data-driven insights and user preferences.

Abstract

A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite the varying significance of different textual elements within a sentence depending on the context, efforts to account for variation of importance in constructing text embeddings have been lacking. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well. SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to data-driven insights and user preferences. The efficacy of SToRI is demonstrated through comprehensive experiments on few-shot image classification and image retrieval tailored to user preferences.

Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP

TL;DR

A framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well is proposed, enabling finer control over emphasis responsive to data-driven insights and user preferences.

Abstract

A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite the varying significance of different textual elements within a sentence depending on the context, efforts to account for variation of importance in constructing text embeddings have been lacking. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well. SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to data-driven insights and user preferences. The efficacy of SToRI is demonstrated through comprehensive experiments on few-shot image classification and image retrieval tailored to user preferences.

Paper Structure

This paper contains 45 sections, 4 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: System diagram of SToRI. SToRI facilitates data-driven control through interpretable weight optimization in the semantic space, enhancing the classification performance of image data. It also enables user-driven control over multiple images by allowing fine-grained manipulation of the text prompts. Weights affect text embeddings via semantic token reweighting (STR).
  • Figure 2: Text prompts and corresponding weights are provided as examples after training. The intensity of the red shading reflects the weight assigned, with darker shades indicating higher weights. For visualization, the weights are normalized to sum up 1. The figures on the right display an example image for each class.
  • Figure 3: Text prompts and their corresponding weights are presented after training with the CUB dataset. The more intense the shade of red, the greater the weight assigned. In each scenario, the text classifier is trained to discriminate two classes. The weights for the same text prompts vary depending on the class to be distinguished.
  • Figure 4: Results of preference retrieval using the text prompt 'a photo of a woman with blonde hair, wearing eyeglasses'. The first row shows density plots with the retrieval order, and the second row visualizes the ratio of retrieved samples within each category. The left column shows results from a plain text prompt, whereas the right column depicts the results when the weights are adjusted. Best viewed in color.
  • Figure 5: AUC scores from preference retrieval with varying weights. The text prompt is 'a photo of a woman with blonde hair, wearing eyeglasses'. The weights on 'with blonde hair' and 'wearing eyeglasses' are $w$ and $(2-w)$, respectively, which are adjusted simultaneously in opposite direction. Best viewed in color.
  • ...and 5 more figures