Table of Contents
Fetching ...

Embedding the Teacher: Distilling vLLM Preferences for Scalable Image Retrieval

Eric He, Akash Gupta, Adian Liusie, Vatsal Raina, Piotr Molenda, Shirom Chabra, Vyas Raina

TL;DR

This paper addresses scalable text–image retrieval for persona-driven product recommendations by distilling the ranking behavior of a strong vLLM into an embedding-based retriever. The approach learns a score $s(x,u) = g_{text}(x; \theta_{text})^{T} g_{img}(u; \theta_{img})$, with a frozen text encoder and a fine-tuned image encoder to approximate the teacher's preferences. It introduces a Bradley–Terry based loss with a preference-aligned sampling strategy to transfer teacher rankings without manual labeling. Experiments on OpenCharacter and Nemotron personas across multiple catalogs show consistent gains over FashionCLIP, CLIP, and text-only baselines, demonstrating scalable, personalized retrieval. The framework generalizes beyond persona matching to other abstract preferences and multi-domain catalogs.

Abstract

Text--image retrieval is necessary for applications such as product recommendation. Embedding-based approaches like CLIP enable efficient large-scale retrieval via vector similarity search, but they are primarily trained on literal caption-like text--image pairs and often fail to capture abstract or persona-driven attributes common in product recommendation applications (e.g., ``a gift for a mother who loves gardening''). In contrast, state-of-the-art vision--language models (vLLMs) can align text with images in a flexible manner, but their limited context window prevents them from directly handling retrieval over large catalogs. We propose a framework that distills the preference rankings of a powerful vLLM into an embedding-based system, transferring its nuanced alignment abilities while maintaining the inference-time scalability of an embedding-based approach. Experiments on persona-driven product recommendation tasks demonstrate that our method significantly outperforms existing embedding-based baselines, providing an efficient solution for personalized text--image retrieval.

Embedding the Teacher: Distilling vLLM Preferences for Scalable Image Retrieval

TL;DR

This paper addresses scalable text–image retrieval for persona-driven product recommendations by distilling the ranking behavior of a strong vLLM into an embedding-based retriever. The approach learns a score , with a frozen text encoder and a fine-tuned image encoder to approximate the teacher's preferences. It introduces a Bradley–Terry based loss with a preference-aligned sampling strategy to transfer teacher rankings without manual labeling. Experiments on OpenCharacter and Nemotron personas across multiple catalogs show consistent gains over FashionCLIP, CLIP, and text-only baselines, demonstrating scalable, personalized retrieval. The framework generalizes beyond persona matching to other abstract preferences and multi-domain catalogs.

Abstract

Text--image retrieval is necessary for applications such as product recommendation. Embedding-based approaches like CLIP enable efficient large-scale retrieval via vector similarity search, but they are primarily trained on literal caption-like text--image pairs and often fail to capture abstract or persona-driven attributes common in product recommendation applications (e.g., ``a gift for a mother who loves gardening''). In contrast, state-of-the-art vision--language models (vLLMs) can align text with images in a flexible manner, but their limited context window prevents them from directly handling retrieval over large catalogs. We propose a framework that distills the preference rankings of a powerful vLLM into an embedding-based system, transferring its nuanced alignment abilities while maintaining the inference-time scalability of an embedding-based approach. Experiments on persona-driven product recommendation tasks demonstrate that our method significantly outperforms existing embedding-based baselines, providing an efficient solution for personalized text--image retrieval.

Paper Structure

This paper contains 32 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Comparison of methods for persona-based product image retrieval. vLLMs capture preferences well but cannot retrieve from a large image catalog (context limit); Embedding-based CLIP enables retrieval but is not well aligned to product recommendation. Our approach distills vLLM preferences into an embedding-based approach to achieve both.
  • Figure 2: Efficient inference with pre-computed image embeddings.
  • Figure 3: Preference-aligned distillation pipeline
  • Figure 4: Top 5 images (in descending order of relevance from left to right) out of the test set of 2048 images from the H&M dataset h_and_m for our method and baseline methods. With the prompt: "a financial analyst who is skeptical about celebrity wealth estimations" (from OpenCharacter Personas OpenCharacter-Personas).