Table of Contents
Fetching ...

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, Kyungsu Kim

TL;DR

The paper addresses the prevalence of single tag bias in CLIP-based image-text alignment by introducing Text-Tag Self-Distillation (TTD), a two-stage fine-tuning framework that operates solely on image-text pairs. It first extracts image-relevant tags from text via a pixel-centric tag scoring method and then performs self-distillation to align the image-text similarity map with the union of pseudo-tag maps, supplemented by a tag-focused loss. This approach yields model-agnostic improvements in multi-tag classification and open-vocabulary segmentation, with substantial gains across several benchmarks and without external supervision or NLP resources. The work also provides automatic tag/mask labeling potential, demonstrates strong ablation results showing the importance of pixel-based tag selection and distillation losses, and highlights practical gains for downstream localization tasks.

Abstract

We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to biased tag relevancy. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. TTD first extracts image-relevant tags from text based on their similarity to the nearest pixels then employs a self-distillation strategy to align combined masks with the text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code is available at https://github.com/shjo-april/TTD.

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

TL;DR

The paper addresses the prevalence of single tag bias in CLIP-based image-text alignment by introducing Text-Tag Self-Distillation (TTD), a two-stage fine-tuning framework that operates solely on image-text pairs. It first extracts image-relevant tags from text via a pixel-centric tag scoring method and then performs self-distillation to align the image-text similarity map with the union of pseudo-tag maps, supplemented by a tag-focused loss. This approach yields model-agnostic improvements in multi-tag classification and open-vocabulary segmentation, with substantial gains across several benchmarks and without external supervision or NLP resources. The work also provides automatic tag/mask labeling potential, demonstrates strong ablation results showing the importance of pixel-based tag selection and distillation losses, and highlights practical gains for downstream localization tasks.

Abstract

We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to biased tag relevancy. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. TTD first extracts image-relevant tags from text based on their similarity to the nearest pixels then employs a self-distillation strategy to align combined masks with the text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code is available at https://github.com/shjo-april/TTD.
Paper Structure (33 sections, 7 equations, 17 figures, 15 tables)

This paper contains 33 sections, 7 equations, 17 figures, 15 tables.

Figures (17)

  • Figure 1: Single Tag Bias. In existing image-text alignment research, single tag bias is evident where the image and text embeddings tend to concentrate solely on a single tag. (a) Single Tag Bias in Image-Text Relationships: When examining the similarity map between the image (i.e., pixels) and text, it is evident that only a region of a single tag inside the red box is activated, disregarding other tags mentioned in the text. (b) Single Tag Bias in Image-Tag Relationships: Even when observing the similarity values between each tag in the text and the image, a high value is assigned only to the single tag, while other tags that describe the image (e.g., "green" and "branch") exhibit similar values to insignificant tags (e.g., "of" and "a"). We use TCL cha2023learning for the analysis.
  • Figure 2: High-level Overview of Our Method. (a) Tag Selection by Pixel-Tag Scoring: The global image embedding predominantly reflects information about a single tag, "jacuzzi" in this case, due to the overactivation of a single tag in pixel embeddings. The colored similarity map shows the most relevant tag for each pixel. (e.g., orange pixels have the highest similarity with "jacuzzi".) We minimize the impact of single tag bias observed in the image embedding by employing cosine similarity between the tag and its most correlated pixel as a score. (b) Text-Tag Self-Distillation: The similarity map between image and text demonstrates only a single tag, whereas it should represent all relevant tags in the image-text relationship. To alleviate the bias, we train the image-text map to align with the union of maps between the image and the pseudo-tags obtained in (a), thus enhancing the image-text alignment.
  • Figure 3: A Conceptual Comparison between Previous Approaches and Ours. (a) Image-text alignment models, while useful for various open-vocabulary tasks, suffer from single tag bias. To alleviate this bias, extracting tags that reflect the image-text relationship from the text for training is crucial. (b) However, existing research often relies on external NLP models to extract tags without considering images, leading to issues: 1) Extracting image-irrelevant tags. 2) Overlooking image-relevant tags. (c) In contrast, we propose a tag selection method using only pixel information from images, eliminating the need for reliance on external models.
  • Figure 4: Overall Framework of Our Method. (a) Tag Selection by Pixel-Tag Scoring: Utilizing the similarity with the most correlated pixel as scoring, we identify a significant gap between tags related to the image and irrelevant ones. This gap guides us in selecting tags with the largest score gap as the threshold. (b) Method Overview: Our fine-tuning approach begins by obtaining pseudo-tags through our tag selection process, which are then used to create an ideal heatmap representing the image-text relationship. We address single tag bias by distilling information between the ideal and actual heatmaps. Additionally, we integrate an auxiliary loss to learn the pixel-tag relationship. Throughout this process, both the image and text encoders are trained.
  • Figure 5: Qualitative Results of Tag Selection. Tag selection using external models loper2002nltkchiang2023vicunabai2023qwen suffers from two problems because they do not consider image information. 1) Extracting Image-irrelevant Tags (red): Select tags that do not correlate with the corresponding image content. 2) Overlooking Image-relevant Tags (blue): Cannot select tags related to the image, particularly non-nouns in the case of NLTK.
  • ...and 12 more figures