Leveraging Large Language Models for Multimodal Search
Oriol Barbany, Michael Huang, Xinliang Zhu, Arnab Dhua
TL;DR
This work presents a comprehensive multimodal search pipeline that combines a novel composed retrieval model with a conversational interface to support image-text queries in fashion. The model fuses CLIP-based image features with a T5 text processor, leverages LoRA adapters, and maps inputs to a shared embedding space trained with an InfoNCE retrieval loss plus language modeling loss, achieving state-of-the-art results on Fashion200K (R@10=71.4, R@50=91.6, Avg=81.5). The conversational interface, inspired by Visual ChatGPT, uses a prompt manager to orchestrate tools and enable natural-language interaction that leverages previous queries (rag-style context), bridging unimodal and multimodal search. Quantitative and qualitative results demonstrate strong retrieval performance and practical, human-like shopping-assistant behavior, while limitations point to generalization challenges and memory/prompt-length constraints as avenues for future work.
Abstract
Multimodal search has become increasingly important in providing users with a natural and effective way to ex-press their search intentions. Images offer fine-grained details of the desired products, while text allows for easily incorporating search modifications. However, some existing multimodal search systems are unreliable and fail to address simple queries. The problem becomes harder with the large variability of natural language text queries, which may contain ambiguous, implicit, and irrelevant in-formation. Addressing these issues may require systems with enhanced matching capabilities, reasoning abilities, and context-aware query parsing and rewriting. This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset. Additionally, we propose a novel search interface integrating Large Language Models (LLMs) to facilitate natural language interaction. This interface routes queries to search systems while conversationally engaging with users and considering previous searches. When coupled with our multimodal search model, it heralds a new era of shopping assistants capable of offering human-like interaction and enhancing the overall search experience.
