Table of Contents
Fetching ...

A Multimodal Approach for Cross-Domain Image Retrieval

Lucas Iijima, Nikolaos Giakoumoglou, Tania Stathaki

TL;DR

This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models, enabling effective cross-domain similarity computation without the need for labeled data or fine-tuning.

Abstract

Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Traditional approaches focus on visual image features and rely heavily on supervised learning with labeled data and cross-domain correspondences, which leads to an often struggle with the significant domain gap. This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models. Our method, dubbed as Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or fine-tuning. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in unsupervised settings with improvements of 24.0% on Office-Home and 132.2% on DomainNet over previous methods. We also demonstrate our method's effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.

A Multimodal Approach for Cross-Domain Image Retrieval

TL;DR

This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models, enabling effective cross-domain similarity computation without the need for labeled data or fine-tuning.

Abstract

Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Traditional approaches focus on visual image features and rely heavily on supervised learning with labeled data and cross-domain correspondences, which leads to an often struggle with the significant domain gap. This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models. Our method, dubbed as Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or fine-tuning. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in unsupervised settings with improvements of 24.0% on Office-Home and 132.2% on DomainNet over previous methods. We also demonstrate our method's effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.
Paper Structure (17 sections, 2 equations, 4 figures, 1 table)

This paper contains 17 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the proposed Caption-Matching (CM) method. The process begins with the BLIP-2 model $g$ generating captions for each image in the database, where $g(x_i) = t_i$ produces textual captions $t_i$ from images $x_i$. These captions are then processed through the text encoder of the CLIP model to obtain text embeddings $u_i = f_{\text{text}}(t_i)$. Simultaneously, the query image $x_q$ is encoded by the image encoder of CLIP $f_{\text{image}}$ to produce a visual embedding $v_q = f_{\text{image}}(x_q)$. A dot-product similarity score $S_{iq} = v_q \cdot u_i$ is calculated between $v_q$ and each $u_i$, resulting in a list of similarity scores $s_q$. The relevance of each caption to the query image is ranked by sorting $s_q$ and the corresponding images can be retrieved based on the highest scores.
  • Figure 2: t-SNE visualization of text embeddings generated by CLIP's text encoder, illustrating the clustering capability across different domains. Each caption follows the format "Domain of a Class", such as "painting of a bird", showing how semantic similarities are preserved while domain-specific identifiers are abstracted away, which facilitates domain-agnostic categorization.
  • Figure 3: Qualitative comparison of the top-10 retrieved results. From top to bottom, the domain pair are C-S (DomainNet), Cl-Rw (Office-Home) and Cl-Ar (Office-Home). Correctly retrieved images are outlined in green, while incorrectly retrieved images are outlined in red.
  • Figure 4: Top-5 retrieval results on Midjourney’s database. Descriptions were generated by BLIP-2 during the application of the CM method.