Table of Contents
Fetching ...

CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval

Bin Kang, Bin Chen, Junjie Wang, Yulin Li, Junzhi Zhao, Zhuotao Tian

TL;DR

CalibCLIP addresses the problem that dominant, low-information tokens in Vision-Language Models hinder fine-grained text-driven image retrieval. It presents a training-free dual-space approach with the Contrastive Visual Enhancer (CVE) to decouple visual features and suppress dominant tokens, and the Discriminative Concept Calibrator (DCC) to disentangle general versus discriminative textual concepts, including a discriminative similarity mechanism. The method yields consistent improvements across seven benchmarks spanning TBPR, TIR, and CIR, including state-of-the-art results on several datasets, without modifying model architectures. This approach offers robust, practical gains for cross-modal Retrieval tasks and reduces reliance on extensive retraining or supervision.

Abstract

Existing Visual Language Models (VLMs) suffer structural limitations where a few low contribution tokens may excessively capture global semantics, dominating the information aggregation process and suppressing the discriminative features in text-driven image retrieval tasks. To address this, we introduce \textbf{CalibCLIP}, a training-free method designed to calibrate the suppressive effect of dominant tokens. Specifically, in the visual space, we propose the Contrastive Visual Enhancer (CVE), which decouples visual features into target and low information regions. Subsequently, it identifies dominant tokens and dynamically suppresses their representations.In the textual space, we introduce the Discriminative Concept Calibrator (DCC), which aims to differentiate between general and discriminative concepts within the text query. By mitigating the challenges posed by generic concepts and improving the representations of discriminative concepts, DCC strengthens the differentiation among similar samples. Finally, extensive experiments demonstrate consistent improvements across seven benchmarks spanning three image retrieval tasks, underscoring the effectiveness of CalibCLIP. Code is available at: https://github.com/kangbin98/CalibCLIP

CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval

TL;DR

CalibCLIP addresses the problem that dominant, low-information tokens in Vision-Language Models hinder fine-grained text-driven image retrieval. It presents a training-free dual-space approach with the Contrastive Visual Enhancer (CVE) to decouple visual features and suppress dominant tokens, and the Discriminative Concept Calibrator (DCC) to disentangle general versus discriminative textual concepts, including a discriminative similarity mechanism. The method yields consistent improvements across seven benchmarks spanning TBPR, TIR, and CIR, including state-of-the-art results on several datasets, without modifying model architectures. This approach offers robust, practical gains for cross-modal Retrieval tasks and reduces reliance on extensive retraining or supervision.

Abstract

Existing Visual Language Models (VLMs) suffer structural limitations where a few low contribution tokens may excessively capture global semantics, dominating the information aggregation process and suppressing the discriminative features in text-driven image retrieval tasks. To address this, we introduce \textbf{CalibCLIP}, a training-free method designed to calibrate the suppressive effect of dominant tokens. Specifically, in the visual space, we propose the Contrastive Visual Enhancer (CVE), which decouples visual features into target and low information regions. Subsequently, it identifies dominant tokens and dynamically suppresses their representations.In the textual space, we introduce the Discriminative Concept Calibrator (DCC), which aims to differentiate between general and discriminative concepts within the text query. By mitigating the challenges posed by generic concepts and improving the representations of discriminative concepts, DCC strengthens the differentiation among similar samples. Finally, extensive experiments demonstrate consistent improvements across seven benchmarks spanning three image retrieval tasks, underscoring the effectiveness of CalibCLIP. Code is available at: https://github.com/kangbin98/CalibCLIP

Paper Structure

This paper contains 15 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 2: CLS/EOT token self-attention. A few low information tokens receive disproportionately high attention, persisting even with task-specific fine-tuning.
  • Figure 3: Comparison of [EOT] token attention: When dominant words like "teddy" and "bear" are masked, attention on the remaining tokens significantly increases.
  • Figure 4: Visualizing attention maps across encoding layers shows the baseline model's tendency to over-focus on low information tokens, whereas our method prioritizes task-relevant regions.
  • Figure 5: Illustration of CalibCLIP framework. We calibrate contextually dominant tokens through a dual-space intervention: In visual space, the CVE module isolates objects from low information regions while suppressing dominant tokens. In text space, the DCC module disentangles text into general and discriminative attributes for fine-grained differentiation.
  • Figure 6: Ablation study of each component of our method on representative datasets for three language-driven retrieval tasks.
  • ...and 1 more figures