TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias
Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, Kyungsu Kim
TL;DR
The paper addresses the prevalence of single tag bias in CLIP-based image-text alignment by introducing Text-Tag Self-Distillation (TTD), a two-stage fine-tuning framework that operates solely on image-text pairs. It first extracts image-relevant tags from text via a pixel-centric tag scoring method and then performs self-distillation to align the image-text similarity map with the union of pseudo-tag maps, supplemented by a tag-focused loss. This approach yields model-agnostic improvements in multi-tag classification and open-vocabulary segmentation, with substantial gains across several benchmarks and without external supervision or NLP resources. The work also provides automatic tag/mask labeling potential, demonstrates strong ablation results showing the importance of pixel-based tag selection and distillation losses, and highlights practical gains for downstream localization tasks.
Abstract
We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to biased tag relevancy. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. TTD first extracts image-relevant tags from text based on their similarity to the nearest pixels then employs a self-distillation strategy to align combined masks with the text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code is available at https://github.com/shjo-april/TTD.
