A Multimodal Approach for Cross-Domain Image Retrieval

Lucas Iijima; Nikolaos Giakoumoglou; Tania Stathaki

A Multimodal Approach for Cross-Domain Image Retrieval

Lucas Iijima, Nikolaos Giakoumoglou, Tania Stathaki

TL;DR

This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models, enabling effective cross-domain similarity computation without the need for labeled data or fine-tuning.

Abstract

Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Traditional approaches focus on visual image features and rely heavily on supervised learning with labeled data and cross-domain correspondences, which leads to an often struggle with the significant domain gap. This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models. Our method, dubbed as Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or fine-tuning. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in unsupervised settings with improvements of 24.0% on Office-Home and 132.2% on DomainNet over previous methods. We also demonstrate our method's effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.

A Multimodal Approach for Cross-Domain Image Retrieval

TL;DR

Abstract

Paper Structure (17 sections, 2 equations, 4 figures, 1 table)

This paper contains 17 sections, 2 equations, 4 figures, 1 table.

Introduction
Related Work
Image Retrieval
Cross-Domain Image Retrieval
Vision-Language Foundation Models
Differentiation from State-of-the-Art
Methodology
Problem Statement
Preliminaries
The Caption-Matching Method
Experiments
Implementation Details
Results
A Class Oracle-based Analysis
Qualitative Evaluation with Multi-Domain Dataset
...and 2 more sections

Figures (4)

Figure 1: Overview of the proposed Caption-Matching (CM) method. The process begins with the BLIP-2 model $g$ generating captions for each image in the database, where $g(x_i) = t_i$ produces textual captions $t_i$ from images $x_i$. These captions are then processed through the text encoder of the CLIP model to obtain text embeddings $u_i = f_{\text{text}}(t_i)$. Simultaneously, the query image $x_q$ is encoded by the image encoder of CLIP $f_{\text{image}}$ to produce a visual embedding $v_q = f_{\text{image}}(x_q)$. A dot-product similarity score $S_{iq} = v_q \cdot u_i$ is calculated between $v_q$ and each $u_i$, resulting in a list of similarity scores $s_q$. The relevance of each caption to the query image is ranked by sorting $s_q$ and the corresponding images can be retrieved based on the highest scores.
Figure 2: t-SNE visualization of text embeddings generated by CLIP's text encoder, illustrating the clustering capability across different domains. Each caption follows the format "Domain of a Class", such as "painting of a bird", showing how semantic similarities are preserved while domain-specific identifiers are abstracted away, which facilitates domain-agnostic categorization.
Figure 3: Qualitative comparison of the top-10 retrieved results. From top to bottom, the domain pair are C-S (DomainNet), Cl-Rw (Office-Home) and Cl-Ar (Office-Home). Correctly retrieved images are outlined in green, while incorrectly retrieved images are outlined in red.
Figure 4: Top-5 retrieval results on Midjourney’s database. Descriptions were generated by BLIP-2 during the application of the CM method.

A Multimodal Approach for Cross-Domain Image Retrieval

TL;DR

Abstract

A Multimodal Approach for Cross-Domain Image Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (4)