Table of Contents
Fetching ...

Personalizing Retrieval using Joint Embeddings or "the Return of Fluffy"

Bruno Korbar, Andrew Zisserman

TL;DR

This work addresses personalized retrieval in vision-language models by learning a pi-map that converts a localized image embedding of an instance into a text token $y^*$, which can be composed with natural language queries. The method keeps CLIP image and text encoders frozen and trains a lightweight three-layer MLP to translate image cues into a token that aligns with a detailed caption $y_s$ while remaining distinct from the generic class token $y_g$, guided by losses $\,\,\mathcal{L}_t$ and $\,\mathcal{L}_i$ in a weighted combination $\mathcal{L} = (1-\alpha)\mathcal{L}_t + \alpha\mathcal{L}_i$ with $\alpha=0.25$. The approach uses localisation of the object via GroundingDINO and caption augmentation through an LLM to enrich the training signal, and demonstrates state-of-the-art performance on the this-is-my and DeepFashion2 personalization benchmarks with relatively few example templates per instance. By offering a plug-and-play image-to-text translation that supports text-augmented and image-based queries, this work enhances practical retrieval scenarios for personal objects in images and videos. The method also shows robustness across CLIP variants, and extensions to video settings via frame sampling, making it broadly applicable to personalized retrieval tasks in real-world media libraries.

Abstract

The goal of this paper is to be able to retrieve images using a compound query that combines object instance information from an image, with a natural text description of what that object is doing or where it is. For example, to retrieve an image of "Fluffy the unicorn (specified by an image) on someone's head". To achieve this we design a mapping network that can "translate" from a local image embedding (of the object instance) to a text token, such that the combination of the token and a natural language query is suitable for CLIP style text encoding, and image retrieval. Generating a text token in this manner involves a simple training procedure, that only needs to be performed once for each object instance. We show that our approach of using a trainable mapping network, termed pi-map, together with frozen CLIP text and image encoders, improves the state of the art on two benchmarks designed to assess personalized retrieval.

Personalizing Retrieval using Joint Embeddings or "the Return of Fluffy"

TL;DR

This work addresses personalized retrieval in vision-language models by learning a pi-map that converts a localized image embedding of an instance into a text token , which can be composed with natural language queries. The method keeps CLIP image and text encoders frozen and trains a lightweight three-layer MLP to translate image cues into a token that aligns with a detailed caption while remaining distinct from the generic class token , guided by losses and in a weighted combination with . The approach uses localisation of the object via GroundingDINO and caption augmentation through an LLM to enrich the training signal, and demonstrates state-of-the-art performance on the this-is-my and DeepFashion2 personalization benchmarks with relatively few example templates per instance. By offering a plug-and-play image-to-text translation that supports text-augmented and image-based queries, this work enhances practical retrieval scenarios for personal objects in images and videos. The method also shows robustness across CLIP variants, and extensions to video settings via frame sampling, making it broadly applicable to personalized retrieval tasks in real-world media libraries.

Abstract

The goal of this paper is to be able to retrieve images using a compound query that combines object instance information from an image, with a natural text description of what that object is doing or where it is. For example, to retrieve an image of "Fluffy the unicorn (specified by an image) on someone's head". To achieve this we design a mapping network that can "translate" from a local image embedding (of the object instance) to a text token, such that the combination of the token and a natural language query is suitable for CLIP style text encoding, and image retrieval. Generating a text token in this manner involves a simple training procedure, that only needs to be performed once for each object instance. We show that our approach of using a trainable mapping network, termed pi-map, together with frozen CLIP text and image encoders, improves the state of the art on two benchmarks designed to assess personalized retrieval.

Paper Structure

This paper contains 23 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Given a few example images of an instance, our $\pi$-map model learns a personalised text embedding for the instance ('"My dog Chia"). This text embedding can then be composed with free form text queries to search amongst a dataset of images or within video frames.
  • Figure 2: (a) Generating a text token, $y^*$, for a specific object instance. The token $y^*$ is obtained by fine-tuning the $\pi$-map given an image $x$ of the instance and a specific text description $y_s$. The $\pi$-map is fine-tuned such that the text embedding of $y^*$ is close to the text embedding of the specific description $y_s$ but away from the text embedding of the the generic class description $y_g$. Also, as a regularization, $y^*$ is close to the original image embedding. The total loss is a linear combination of text embedding loss, $\mathcal{L}_t$, and the image embedding loss, $\mathcal{L}_i$. (b) Caption augmentation using an LLM rekateam2024rekacoreflashedge
  • Figure 3: A qualitative sample of contextual retrieval sorted from left to right from this-is-my yeh2023meta and DeepFashion2 DeepFashion2 datasets. Green and red rectangles correspond to the correctly and incorrectly retrieved segments/images. Dotted green line shows correctly retrieved instances but in wrong setting.
  • Figure 4: Importance of using localised features: learning personalised features for 'My dog Chia' from two different sets of template images. In a) all template images come from the same time and place, while in b), the images are varied. Our method ranks the correct image first on both occasions, while PALVARA eccv2022_palavra_cohen remains sensitive to the diversity of the template images.
  • Figure 5: Examples from our evaluation datasets.