Personalizing Retrieval using Joint Embeddings or "the Return of Fluffy"
Bruno Korbar, Andrew Zisserman
TL;DR
This work addresses personalized retrieval in vision-language models by learning a pi-map that converts a localized image embedding of an instance into a text token $y^*$, which can be composed with natural language queries. The method keeps CLIP image and text encoders frozen and trains a lightweight three-layer MLP to translate image cues into a token that aligns with a detailed caption $y_s$ while remaining distinct from the generic class token $y_g$, guided by losses $\,\,\mathcal{L}_t$ and $\,\mathcal{L}_i$ in a weighted combination $\mathcal{L} = (1-\alpha)\mathcal{L}_t + \alpha\mathcal{L}_i$ with $\alpha=0.25$. The approach uses localisation of the object via GroundingDINO and caption augmentation through an LLM to enrich the training signal, and demonstrates state-of-the-art performance on the this-is-my and DeepFashion2 personalization benchmarks with relatively few example templates per instance. By offering a plug-and-play image-to-text translation that supports text-augmented and image-based queries, this work enhances practical retrieval scenarios for personal objects in images and videos. The method also shows robustness across CLIP variants, and extensions to video settings via frame sampling, making it broadly applicable to personalized retrieval tasks in real-world media libraries.
Abstract
The goal of this paper is to be able to retrieve images using a compound query that combines object instance information from an image, with a natural text description of what that object is doing or where it is. For example, to retrieve an image of "Fluffy the unicorn (specified by an image) on someone's head". To achieve this we design a mapping network that can "translate" from a local image embedding (of the object instance) to a text token, such that the combination of the token and a natural language query is suitable for CLIP style text encoding, and image retrieval. Generating a text token in this manner involves a simple training procedure, that only needs to be performed once for each object instance. We show that our approach of using a trainable mapping network, termed pi-map, together with frozen CLIP text and image encoders, improves the state of the art on two benchmarks designed to assess personalized retrieval.
