Table of Contents
Fetching ...

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

Haoqiang Lin, Haokun Wen, Xuemeng Song, Meng Liu, Yupeng Hu, Liqiang Nie

TL;DR

This paper tackles zero-shot composed image retrieval (ZS-CIR) by introducing FTI4CIR, which maps each image into a fine-grained set of pseudo-words: a subject-oriented token and multiple attribute-oriented tokens. It introduces a dynamic local attribute feature extraction module and an orthogonality constraint to learn diverse, domain-sensitive local attributes, paired with tri-wise caption-based semantic regularization that aligns pseudo-words with real-word embeddings using BLIP-generated captions. Inference reduces CIR to a pure text-to-image retrieval task by composing pseudo-words from the reference image with the modification text, facilitating zero-shot performance. Experiments on FashionIQ, CIRR, and CIRCO show consistent improvements over zero-shot baselines and strong generalization, with ablations confirming the importance of each component and the benefit of the tri-wise regularization. The work advances zero-shot CIR by capturing rich, fine-grained image content and aligning it effectively with textual space, enabling more accurate and flexible user queries across domains.

Abstract

Composed Image Retrieval (CIR) allows users to search target images with a multimodal query, comprising a reference image and a modification text that describes the user's modification demand over the reference image. Nevertheless, due to the expensive labor cost of training data annotation, recent researchers have shifted to the challenging task of zero-shot CIR (ZS-CIR), which targets fulfilling CIR without annotated triplets. The pioneer ZS-CIR studies focus on converting the CIR task into a standard text-to-image retrieval task by pre-training a textual inversion network that can map a given image into a single pseudo-word token. Despite their significant progress, their coarse-grained textual inversion may be insufficient to capture the full content of the image accurately. To overcome this issue, in this work, we propose a novel Fine-grained Textual Inversion Network for ZS-CIR, named FTI4CIR. In particular, FTI4CIR comprises two main components: fine-grained pseudo-word token mapping and tri-wise caption-based semantic regularization. The former maps the image into a subject-oriented pseudo-word token and several attribute-oriented pseudo-word tokens to comprehensively express the image in the textual form, while the latter works on jointly aligning the fine-grained pseudo-word tokens to the real-word token embedding space based on a BLIP-generated image caption template. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our proposed method.

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

TL;DR

This paper tackles zero-shot composed image retrieval (ZS-CIR) by introducing FTI4CIR, which maps each image into a fine-grained set of pseudo-words: a subject-oriented token and multiple attribute-oriented tokens. It introduces a dynamic local attribute feature extraction module and an orthogonality constraint to learn diverse, domain-sensitive local attributes, paired with tri-wise caption-based semantic regularization that aligns pseudo-words with real-word embeddings using BLIP-generated captions. Inference reduces CIR to a pure text-to-image retrieval task by composing pseudo-words from the reference image with the modification text, facilitating zero-shot performance. Experiments on FashionIQ, CIRR, and CIRCO show consistent improvements over zero-shot baselines and strong generalization, with ablations confirming the importance of each component and the benefit of the tri-wise regularization. The work advances zero-shot CIR by capturing rich, fine-grained image content and aligning it effectively with textual space, enabling more accurate and flexible user queries across domains.

Abstract

Composed Image Retrieval (CIR) allows users to search target images with a multimodal query, comprising a reference image and a modification text that describes the user's modification demand over the reference image. Nevertheless, due to the expensive labor cost of training data annotation, recent researchers have shifted to the challenging task of zero-shot CIR (ZS-CIR), which targets fulfilling CIR without annotated triplets. The pioneer ZS-CIR studies focus on converting the CIR task into a standard text-to-image retrieval task by pre-training a textual inversion network that can map a given image into a single pseudo-word token. Despite their significant progress, their coarse-grained textual inversion may be insufficient to capture the full content of the image accurately. To overcome this issue, in this work, we propose a novel Fine-grained Textual Inversion Network for ZS-CIR, named FTI4CIR. In particular, FTI4CIR comprises two main components: fine-grained pseudo-word token mapping and tri-wise caption-based semantic regularization. The former maps the image into a subject-oriented pseudo-word token and several attribute-oriented pseudo-word tokens to comprehensively express the image in the textual form, while the latter works on jointly aligning the fine-grained pseudo-word tokens to the real-word token embedding space based on a BLIP-generated image caption template. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our proposed method.

Paper Structure

This paper contains 18 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An illustration of method comparison. (a) Exiting textual inversion for ZS-CIR. (b) Our fine-grained textual inversion for ZS-CIR.
  • Figure 2: The proposed FTI4CIR consists of two key modules: (a) Fine-grained pseudo-word token mapping and (b) Tri-wise caption-based semantic regularization.
  • Figure 3: An example of BLIP generated caption, which can be divided into two parts: $\hat{T}_{subj}$ and $\hat{T}_{attr}$.
  • Figure 4: Sensitivity analysis of our model on the number of latent local attributes $n$. Notably, we reported the average results of R@$10$ and R@$50$ on FashionIQ.
  • Figure 5: Pseudo-to-real description retrieved results. We highlight the related real-word descriptions in green.
  • ...and 1 more figures