Table of Contents
Fetching ...

Composed Image Retrieval for Training-Free Domain Conversion

Nikos Efthymiadis, Bill Psomas, Zakaria Laskar, Konstantinos Karantzalos, Yannis Avrithis, Ondřej Chum, Giorgos Tolias

TL;DR

FreeDom tackles training-free composed image retrieval in open-world domain conversion by mapping query images into a discrete text vocabulary through memory-based NN textual inversion and retrieving targets via a weighted ensemble of text queries that combine mapped words with the target domain. It leverages a frozen vision-language model (CLIP) and retrieval augmentation with a visual memory to perform robust cross-domain search without training. The approach is validated on multiple domain-conversion benchmarks, showing large gains over prior training-based and training-free CIR methods and demonstrating the effectiveness of discrete word inversion and memory expansion. The work provides a practical, scalable framework with new benchmarks and a strong foundation for future comparisons in domain-conversion CIR and broader composed image retrieval tasks.

Abstract

This work addresses composed image retrieval in the context of domain conversion, where the content of a query image is retrieved in the domain specified by the query text. We show that a strong vision-language model provides sufficient descriptive power without additional training. The query image is mapped to the text input space using textual inversion. Unlike common practice that invert in the continuous space of text tokens, we use the discrete word space via a nearest-neighbor search in a text vocabulary. With this inversion, the image is softly mapped across the vocabulary and is made more robust using retrieval-based augmentation. Database images are retrieved by a weighted ensemble of text queries combining mapped words with the domain text. Our method outperforms prior art by a large margin on standard and newly introduced benchmarks. Code: https://github.com/NikosEfth/freedom

Composed Image Retrieval for Training-Free Domain Conversion

TL;DR

FreeDom tackles training-free composed image retrieval in open-world domain conversion by mapping query images into a discrete text vocabulary through memory-based NN textual inversion and retrieving targets via a weighted ensemble of text queries that combine mapped words with the target domain. It leverages a frozen vision-language model (CLIP) and retrieval augmentation with a visual memory to perform robust cross-domain search without training. The approach is validated on multiple domain-conversion benchmarks, showing large gains over prior training-based and training-free CIR methods and demonstrating the effectiveness of discrete word inversion and memory expansion. The work provides a practical, scalable framework with new benchmarks and a strong foundation for future comparisons in domain-conversion CIR and broader composed image retrieval tasks.

Abstract

This work addresses composed image retrieval in the context of domain conversion, where the content of a query image is retrieved in the domain specified by the query text. We show that a strong vision-language model provides sufficient descriptive power without additional training. The query image is mapped to the text input space using textual inversion. Unlike common practice that invert in the continuous space of text tokens, we use the discrete word space via a nearest-neighbor search in a text vocabulary. With this inversion, the image is softly mapped across the vocabulary and is made more robust using retrieval-based augmentation. Database images are retrieved by a weighted ensemble of text queries combining mapped words with the domain text. Our method outperforms prior art by a large margin on standard and newly introduced benchmarks. Code: https://github.com/NikosEfth/freedom

Paper Structure

This paper contains 58 sections, 16 equations, 9 figures, 19 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of FreeDom. Given a query image and a query text indicating the target domain, proxy images are first retrieved from the query through an image-to-image search over a visual memory. Then, a set of text labels is associated with each proxy image through an image-to-text search over a textual memory. Each of the most frequent text labels is combined with the query text in the text space, and images are retrieved from the database by text-to-image search. The resulting sets of similarities are linearly combined with the frequencies of occurrence as weights. Here: $k=4$ proxy images, $n=3$ text labels per proxy image, $m=2$ most frequent text labels.
  • Figure 2: Histogram of similarities between a query and database images: negative (wrong object and domain); positive only w.r.t. the object (correct object, wrong domain); positive only w.r.t. the domain (wrong object, correct domain); positive (correct object and domain). $E$: early fusion; $L$: late fusion; $L^{+}$: late fusion with memory-based expansion; $L^{+}_{\alpha}$: late fusion with memory-based expansion and frequencies as weights; AP: average precision. For better visualization, we sample an equal number of negatives, positives w.r.t. object, and positives w.r.t. domain, while the values in the histogram of positives are multiplied by 10. MiniDomainNet; text query: "clipart".
  • Figure 3: Different Inversions with visual memory expansion and late-fusion on Image-Net-R
  • Figure 4: Impact of the visual memory: Performance comparison between no visual memory, the database as visual memory, and visual memory comprising LAION laion400m images of various sizes.
  • Figure 5: Performance vs. query time. Different variants of FreeDom are shown by varying hyper-parameter ($k$, $m$, $n$) values and textual memory size. FreeDom uses textual memory with size 20k (default) or 236k (reported).
  • ...and 4 more figures