Table of Contents
Fetching ...

Revisiting Relevance Feedback for CLIP-based Interactive Image Retrieval

Ryoya Nara, Yu-Chieh Lin, Yuji Nozawa, Youyang Ng, Goh Itoh, Osamu Torii, Yusuke Matsui

TL;DR

This work tackles the limitations of metric learning for image retrieval by introducing CLIP-based interactive retrieval with relevance feedback. The method uses binary user feedback to adapt the retrieval in real time without retraining image encoders, leveraging CLIP's zero-shot capabilities to achieve competitive accuracy on category-based tasks and improvements in one-label and conditioned retrieval settings. A simple online update via a 1-NN classifier over the collected feedback refines the second retrieval pass, enabling user-specific preferences to be captured efficiently. The approach demonstrates that combining CLIP with classic relevance feedback can yield strong performance with realistic feedback sizes and offers a practical baseline for interactive, bias-aware image search without heavy retraining.

Abstract

Many image retrieval studies use metric learning to train an image encoder. However, metric learning cannot handle differences in users' preferences, and requires data to train an image encoder. To overcome these limitations, we revisit relevance feedback, a classic technique for interactive retrieval systems, and propose an interactive CLIP-based image retrieval system with relevance feedback. Our retrieval system first executes the retrieval, collects each user's unique preferences through binary feedback, and returns images the user prefers. Even when users have various preferences, our retrieval system learns each user's preference through the feedback and adapts to the preference. Moreover, our retrieval system leverages CLIP's zero-shot transferability and achieves high accuracy without training. We empirically show that our retrieval system competes well with state-of-the-art metric learning in category-based image retrieval, despite not training image encoders specifically for each dataset. Furthermore, we set up two additional experimental settings where users have various preferences: one-label-based image retrieval and conditioned image retrieval. In both cases, our retrieval system effectively adapts to each user's preferences, resulting in improved accuracy compared to image retrieval without feedback. Overall, our work highlights the potential benefits of integrating CLIP with classic relevance feedback techniques to enhance image retrieval.

Revisiting Relevance Feedback for CLIP-based Interactive Image Retrieval

TL;DR

This work tackles the limitations of metric learning for image retrieval by introducing CLIP-based interactive retrieval with relevance feedback. The method uses binary user feedback to adapt the retrieval in real time without retraining image encoders, leveraging CLIP's zero-shot capabilities to achieve competitive accuracy on category-based tasks and improvements in one-label and conditioned retrieval settings. A simple online update via a 1-NN classifier over the collected feedback refines the second retrieval pass, enabling user-specific preferences to be captured efficiently. The approach demonstrates that combining CLIP with classic relevance feedback can yield strong performance with realistic feedback sizes and offers a practical baseline for interactive, bias-aware image search without heavy retraining.

Abstract

Many image retrieval studies use metric learning to train an image encoder. However, metric learning cannot handle differences in users' preferences, and requires data to train an image encoder. To overcome these limitations, we revisit relevance feedback, a classic technique for interactive retrieval systems, and propose an interactive CLIP-based image retrieval system with relevance feedback. Our retrieval system first executes the retrieval, collects each user's unique preferences through binary feedback, and returns images the user prefers. Even when users have various preferences, our retrieval system learns each user's preference through the feedback and adapts to the preference. Moreover, our retrieval system leverages CLIP's zero-shot transferability and achieves high accuracy without training. We empirically show that our retrieval system competes well with state-of-the-art metric learning in category-based image retrieval, despite not training image encoders specifically for each dataset. Furthermore, we set up two additional experimental settings where users have various preferences: one-label-based image retrieval and conditioned image retrieval. In both cases, our retrieval system effectively adapts to each user's preferences, resulting in improved accuracy compared to image retrieval without feedback. Overall, our work highlights the potential benefits of integrating CLIP with classic relevance feedback techniques to enhance image retrieval.
Paper Structure (26 sections, 10 equations, 10 figures, 8 tables)

This paper contains 26 sections, 10 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Overall of our proposed method.
  • Figure 2: Evaluation of retrieval algorithm with relevance feedback. We omit the detail of the second retrieval process described in \ref{['fig:proposed_method_all']}.
  • Figure 3: An example of one-label-based image retrieval in COCO.
  • Figure 4: An example of conditioned image retrieval.
  • Figure 5: Comparison among various kinds of CLIP and $M$.
  • ...and 5 more figures