Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval
Haoqiang Lin, Haokun Wen, Xuemeng Song, Meng Liu, Yupeng Hu, Liqiang Nie
TL;DR
This paper tackles zero-shot composed image retrieval (ZS-CIR) by introducing FTI4CIR, which maps each image into a fine-grained set of pseudo-words: a subject-oriented token and multiple attribute-oriented tokens. It introduces a dynamic local attribute feature extraction module and an orthogonality constraint to learn diverse, domain-sensitive local attributes, paired with tri-wise caption-based semantic regularization that aligns pseudo-words with real-word embeddings using BLIP-generated captions. Inference reduces CIR to a pure text-to-image retrieval task by composing pseudo-words from the reference image with the modification text, facilitating zero-shot performance. Experiments on FashionIQ, CIRR, and CIRCO show consistent improvements over zero-shot baselines and strong generalization, with ablations confirming the importance of each component and the benefit of the tri-wise regularization. The work advances zero-shot CIR by capturing rich, fine-grained image content and aligning it effectively with textual space, enabling more accurate and flexible user queries across domains.
Abstract
Composed Image Retrieval (CIR) allows users to search target images with a multimodal query, comprising a reference image and a modification text that describes the user's modification demand over the reference image. Nevertheless, due to the expensive labor cost of training data annotation, recent researchers have shifted to the challenging task of zero-shot CIR (ZS-CIR), which targets fulfilling CIR without annotated triplets. The pioneer ZS-CIR studies focus on converting the CIR task into a standard text-to-image retrieval task by pre-training a textual inversion network that can map a given image into a single pseudo-word token. Despite their significant progress, their coarse-grained textual inversion may be insufficient to capture the full content of the image accurately. To overcome this issue, in this work, we propose a novel Fine-grained Textual Inversion Network for ZS-CIR, named FTI4CIR. In particular, FTI4CIR comprises two main components: fine-grained pseudo-word token mapping and tri-wise caption-based semantic regularization. The former maps the image into a subject-oriented pseudo-word token and several attribute-oriented pseudo-word tokens to comprehensively express the image in the textual form, while the latter works on jointly aligning the fine-grained pseudo-word tokens to the real-word token embedding space based on a BLIP-generated image caption template. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our proposed method.
