Table of Contents
Fetching ...

Leveraging Large Language Models for Multimodal Search

Oriol Barbany, Michael Huang, Xinliang Zhu, Arnab Dhua

TL;DR

This work presents a comprehensive multimodal search pipeline that combines a novel composed retrieval model with a conversational interface to support image-text queries in fashion. The model fuses CLIP-based image features with a T5 text processor, leverages LoRA adapters, and maps inputs to a shared embedding space trained with an InfoNCE retrieval loss plus language modeling loss, achieving state-of-the-art results on Fashion200K (R@10=71.4, R@50=91.6, Avg=81.5). The conversational interface, inspired by Visual ChatGPT, uses a prompt manager to orchestrate tools and enable natural-language interaction that leverages previous queries (rag-style context), bridging unimodal and multimodal search. Quantitative and qualitative results demonstrate strong retrieval performance and practical, human-like shopping-assistant behavior, while limitations point to generalization challenges and memory/prompt-length constraints as avenues for future work.

Abstract

Multimodal search has become increasingly important in providing users with a natural and effective way to ex-press their search intentions. Images offer fine-grained details of the desired products, while text allows for easily incorporating search modifications. However, some existing multimodal search systems are unreliable and fail to address simple queries. The problem becomes harder with the large variability of natural language text queries, which may contain ambiguous, implicit, and irrelevant in-formation. Addressing these issues may require systems with enhanced matching capabilities, reasoning abilities, and context-aware query parsing and rewriting. This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset. Additionally, we propose a novel search interface integrating Large Language Models (LLMs) to facilitate natural language interaction. This interface routes queries to search systems while conversationally engaging with users and considering previous searches. When coupled with our multimodal search model, it heralds a new era of shopping assistants capable of offering human-like interaction and enhancing the overall search experience.

Leveraging Large Language Models for Multimodal Search

TL;DR

This work presents a comprehensive multimodal search pipeline that combines a novel composed retrieval model with a conversational interface to support image-text queries in fashion. The model fuses CLIP-based image features with a T5 text processor, leverages LoRA adapters, and maps inputs to a shared embedding space trained with an InfoNCE retrieval loss plus language modeling loss, achieving state-of-the-art results on Fashion200K (R@10=71.4, R@50=91.6, Avg=81.5). The conversational interface, inspired by Visual ChatGPT, uses a prompt manager to orchestrate tools and enable natural-language interaction that leverages previous queries (rag-style context), bridging unimodal and multimodal search. Quantitative and qualitative results demonstrate strong retrieval performance and practical, human-like shopping-assistant behavior, while limitations point to generalization challenges and memory/prompt-length constraints as avenues for future work.

Abstract

Multimodal search has become increasingly important in providing users with a natural and effective way to ex-press their search intentions. Images offer fine-grained details of the desired products, while text allows for easily incorporating search modifications. However, some existing multimodal search systems are unreliable and fail to address simple queries. The problem becomes harder with the large variability of natural language text queries, which may contain ambiguous, implicit, and irrelevant in-formation. Addressing these issues may require systems with enhanced matching capabilities, reasoning abilities, and context-aware query parsing and rewriting. This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset. Additionally, we propose a novel search interface integrating Large Language Models (LLMs) to facilitate natural language interaction. This interface routes queries to search systems while conversationally engaging with users and considering previous searches. When coupled with our multimodal search model, it heralds a new era of shopping assistants capable of offering human-like interaction and enhancing the overall search experience.
Paper Structure (12 sections, 5 equations, 4 figures, 1 table)

This paper contains 12 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview: This paper introduces a comprehensive pipeline for multimodal search, presenting a novel composed retrieval model that outperforms previous approaches significantly. Additionally, we propose a system that utilizes a as an orchestrator to invoke both our proposed model and other off-the-shelf models. The resulting search interface offers a conversational search assistant experience, integrating information from previous queries and leveraging our novel model to enhance search capabilities.
  • Figure 2: Proposed architecture: We extract visual features from the reference image $\mathbf{x}_{\text{ref}}$ using a Vision Transformer dosovitskiy2020vit, specifically, a pretrained CLIP clip model with frozen weights. We extract features before the projection layer, which are then processed using a , which performs cross-attention with a set of learned queries. The resulting output of the is concatenated with the embeddings obtained from the modifying text ($\mathbf{t}$), which expresses a modification in the reference image. Subsequently, all this information is fed into a T5 model t5, an encoder-decoder . We employ hu_lora_2021 to learn low-rank updates for the query and value matrices in all attention layers, while keeping the rest of the parameters frozen. The output of the yields a probability distribution from which a sentence is generated. To ensure alignment with the target caption (i.e., the caption of the target image $\mathbf{x}_{\text{trg}}$, which corresponds to the caption of the reference image after incorporating the text modifications), a language modeling loss is used. The hidden states of the are then projected into a space of embeddings used for retrieval. A retrieval loss term pushes together the embedding of the target image $\mathcal{G}(\mathbf{x}_{\text{trg}})$ and that obtained using the reference image and the modifying text $\mathcal{F}(\mathbf{x}_{\text{ref}}, \mathbf{t})$.
  • Figure 3: Qualitative results: Examples of queries of the Fashion-200k dataset fashion200k and the 4 best matches. The correct matches are shown in green and incorrect ones in red. In the succesful examples, we can see that our proposal is able to incorporate modifications to the input product involving changes to color and material among others. Despite not retrieving the correct products in the failure examples, almost all the retrieved images satisfy the search criteria.
  • Figure 4: Proposed conversational multimodal search system: In this example, the user uploads an image from the Fashion200K dataset fashion200k and provides text input intending to search an a dress similar to the product in the image but in a different color. An , specifically GPT-3 gpt3, processes the user's prompt and invokes our novel multimodal search model with the uploaded image and a formatted text query. The desired attribute indicated by the user is "beige", which can be inferred from the text input. The original attribute is required by the prompt used during the training of our model and is correctly identified by the as "gray". In this case, the can obtain this information leveraging the based on obtaining the product descriptions of the first matches using image search with the uploaded picture. The conversational nature of the interactions with the user offers an improved search experience.