Table of Contents
Fetching ...

TMCIR: Token Merge Benefits Composed Image Retrieval

Chaoyang Wang, Zeyu Zhang, Long Teng, Zijun Li, Shichao Kan

TL;DR

This paper tackles biased cross-modal fusion in composed image retrieval by introducing TMCIR, which combines diffusion-based pseudo-target generation for intent-aware cross-modal alignment with an adaptive token fusion strategy. The method first generates pseudo-target images conditioned on a reference image and relative text, then fine-tunes CLIP encoders on image-text pairs to achieve aligned cross-modal tokens, and finally fuses visual and textual tokens using similarity-weighted, position-aware merging. Key contributions include the diffusion-conditioned pseudo-target approach (IACMA) and the similarity-based token merge (ATF), which together preserve reference details while faithfully encoding modification intent. Empirical results on Fashion-IQ and CIRR show state-of-the-art performance and improved robustness in capturing nuanced user intents for CIR tasks.

Abstract

Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications. The primary challenge is effectively fusing this visual and textual information. Current cross-modal feature fusion approaches for CIR exhibit an inherent bias in intention interpretation. These methods tend to disproportionately emphasize either the reference image features (visual-dominant fusion) or the textual modification intent (text-dominant fusion through image-to-text conversion). Such an imbalanced representation often fails to accurately capture and reflect the actual search intent of the user in the retrieval results. To address this challenge, we propose TMCIR, a novel framework that advances composed image retrieval through two key innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP encoders contrastively using intent-reflecting pseudo-target images, synthesized from reference images and textual descriptions via a diffusion model. This step enhances the encoder ability of text to capture nuanced intents in textual descriptions. 2) Adaptive Token Fusion. We further fine-tune all encoders contrastively by comparing adaptive token-fusion features with the target image. This mechanism dynamically balances visual and textual representations within the contrastive learning pipeline, optimizing the composed feature for retrieval. Extensive experiments on Fashion-IQ and CIRR datasets demonstrate that TMCIR significantly outperforms state-of-the-art methods, particularly in capturing nuanced user intent.

TMCIR: Token Merge Benefits Composed Image Retrieval

TL;DR

This paper tackles biased cross-modal fusion in composed image retrieval by introducing TMCIR, which combines diffusion-based pseudo-target generation for intent-aware cross-modal alignment with an adaptive token fusion strategy. The method first generates pseudo-target images conditioned on a reference image and relative text, then fine-tunes CLIP encoders on image-text pairs to achieve aligned cross-modal tokens, and finally fuses visual and textual tokens using similarity-weighted, position-aware merging. Key contributions include the diffusion-conditioned pseudo-target approach (IACMA) and the similarity-based token merge (ATF), which together preserve reference details while faithfully encoding modification intent. Empirical results on Fashion-IQ and CIRR show state-of-the-art performance and improved robustness in capturing nuanced user intents for CIR tasks.

Abstract

Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications. The primary challenge is effectively fusing this visual and textual information. Current cross-modal feature fusion approaches for CIR exhibit an inherent bias in intention interpretation. These methods tend to disproportionately emphasize either the reference image features (visual-dominant fusion) or the textual modification intent (text-dominant fusion through image-to-text conversion). Such an imbalanced representation often fails to accurately capture and reflect the actual search intent of the user in the retrieval results. To address this challenge, we propose TMCIR, a novel framework that advances composed image retrieval through two key innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP encoders contrastively using intent-reflecting pseudo-target images, synthesized from reference images and textual descriptions via a diffusion model. This step enhances the encoder ability of text to capture nuanced intents in textual descriptions. 2) Adaptive Token Fusion. We further fine-tune all encoders contrastively by comparing adaptive token-fusion features with the target image. This mechanism dynamically balances visual and textual representations within the contrastive learning pipeline, optimizing the composed feature for retrieval. Extensive experiments on Fashion-IQ and CIRR datasets demonstrate that TMCIR significantly outperforms state-of-the-art methods, particularly in capturing nuanced user intent.

Paper Structure

This paper contains 14 sections, 20 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Workflows of existing CIR methods and our proposed TMCIR
  • Figure 2: Retrieval examples using the proposed TMCIR, CLIP4CIR baldrati2022conditioned (visual-dominant feature fusion), and Pic2word saito2023pic2word (text-dominant fusion) methods, respectively.
  • Figure 3: An Overview of the TMCIR Framework.It consists of two modules: the "Intent-Aware Cross-Modal Alignment" module and the "Adaptive Token Fusion" module. First, we input the reference image and the relative description into a diffusion model to generate a pseudo-target image. Through contrastive learning, we guide the visual and textual encoders to achieve cross-modal token distribution alignment. Then, the reference image and the relative description are fused using an adaptive token fusion strategy based on positional encoding and semantic similarity, generating a joint representation that captures both the user intent and the key visual information from the reference image.
  • Figure 4: Ablation studies in terms of average recalls with regards to different values of Similarity Threshold