Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express
Cherag Aroraa, Tracy Holloway King, Jayant Kumar, Yi Lu, Sanat Sharma, Arvind Srikantan, David Uvalle, Josep Valls-Vargas, Harsha Vardhan
TL;DR
This paper tackles multi-modal search for Adobe Express templates by integrating text, images, and contextual signals within a production ELasticsearch-based pipeline through an iterative AB-testing framework. It combines dense and sparse embeddings from domain-specific models (AdobeCLIP) with a symbolic Knowledge Graph (CKG) and a domain-focused MM-CKG loss (SupCoLA) to improve recall, ranking, and long-tail query handling. Key contributions include sparse-augmentation of the initial match, re-ranking with external and domain-specific image-text models, symbolic intent recovery for null-heavy queries, and long-query improvements via MM-CKG, all leading to significant CTR gains and null-rate reductions. The work provides practical guidance for deploying robust, latency-aware multi-modal search systems capable of improving relevance for complex, multi-modal templates in real-world production settings.
Abstract
As user content and queries become increasingly multi-modal, the need for effective multi-modal search systems has grown. Traditional search systems often rely on textual and metadata annotations for indexed images, while multi-modal embeddings like CLIP enable direct search using text and image embeddings. However, embedding-based approaches face challenges in integrating contextual features such as user locale and recency. Building a scalable multi-modal search system requires fine-tuning several components. This paper presents a multi-modal search architecture and a series of AB tests that optimize embeddings and multi-modal technologies in Adobe Express template search. We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings. Our iterative approach demonstrates how utilizing sparse, dense, and contextual features enhances short and long query search, significantly reduces null rates (over 70\%), and increases click-through rates (CTR). Our findings provide insights into developing robust multi-modal search systems, thereby enhancing relevance for complex queries.
