Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

Huaying Zhang; Rintaro Yanagi; Ren Togo; Takahiro Ogawa; Miki Haseyama

Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

Huaying Zhang, Rintaro Yanagi, Ren Togo, Takahiro Ogawa, Miki Haseyama

TL;DR

By exploiting the abundant image-text pairs that are convenient to obtain with a masking strategy for learning the query-target relationship, it is expected that accurate zero-shot CIR using a retrieval-focused textual inversion network can be realized.

Abstract

This paper proposes a novel zero-shot composed image retrieval (CIR) method considering the query-target relationship by masked image-text pairs. The objective of CIR is to retrieve the target image using a query image and a query text. Existing methods use a textual inversion network to convert the query image into a pseudo word to compose the image and text and use a pre-trained visual-language model to realize the retrieval. However, they do not consider the query-target relationship to train the textual inversion network to acquire information for retrieval. In this paper, we propose a novel zero-shot CIR method that is trained end-to-end using masked image-text pairs. By exploiting the abundant image-text pairs that are convenient to obtain with a masking strategy for learning the query-target relationship, it is expected that accurate zero-shot CIR using a retrieval-focused textual inversion network can be realized. Experimental results show the effectiveness of the proposed method.

Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

TL;DR

Abstract

Paper Structure (11 sections, 12 equations, 3 figures, 3 tables)

This paper contains 11 sections, 12 equations, 3 figures, 3 tables.

Introduction
Zero-shot CIR with Query-target Relationship
Image-text Masking
Query Composition
Loss Calculation
Experiments
Experiment Details
Experimental Results
Additional Experiment
Limitations
Conclusion

Figures (3)

Figure 1: Overview of our proposed method. We use the masked image-text pair as a query and the original as the target image to train the textual inversion network. The image-text masking is realized by class activation map (CAM) Chefer_2021_ICCV. The noun word first appears in the text is masked, and the region that is not related to the word of the image is masked by the other image in the same batch.
Figure 2: Qualitative results of PM and Pic2Word Saito_2023_CVPR. The symbol * indicates the pseudo word generated from the query image. The image with a green frame means the ground truth image is at the top of the retrieval list, while the image with a red frame means a false image is ranked at the top.
Figure 3: Failure examples of PM. The symbol * indicates the pseudo word generated from the query image. The image with a red frame means the retrieved image is biased to the text query.

Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

TL;DR

Abstract

Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

Authors

TL;DR

Abstract

Table of Contents

Figures (3)