DashCLIP: Leveraging multimodal models for generating semantic embeddings for DoorDash

Omkar Gurjar; Kin Sum Liu; Praveen Kolli; Utsaw Kumar; Mandar Rahurkar

DashCLIP: Leveraging multimodal models for generating semantic embeddings for DoorDash

Omkar Gurjar, Kin Sum Liu, Praveen Kolli, Utsaw Kumar, Mandar Rahurkar

TL;DR

DashCLIP tackles the challenge of producing high-quality semantic embeddings for e-commerce products and user queries without relying on engagement history. It introduces a two-stage training framework that first continually pre-trains product encoders on the catalog and then aligns product and query embeddings via a Product-Query contrastive loss using an LLM-curated relevance dataset. The approach achieves strong generalization across retrieval, relevance prediction, and ads ranking, with notable improvements in offline CTR/AUC and positive online deployment metrics. These results demonstrate the practical impact of domain-adapted, multimodal embeddings for search and personalized advertising in large-scale e-commerce platforms.

Abstract

Despite the success of vision-language models in various generative tasks, obtaining high-quality semantic representations for products and user intents is still challenging due to the inability of off-the-shelf models to capture nuanced relationships between the entities. In this paper, we introduce a joint training framework for product and user queries by aligning uni-modal and multi-modal encoders through contrastive learning on image-text data. Our novel approach trains a query encoder with an LLM-curated relevance dataset, eliminating the reliance on engagement history. These embeddings demonstrate strong generalization capabilities and improve performance across applications, including product categorization and relevance prediction. For personalized ads recommendation, a significant uplift in the click-through rate and conversion rate after the deployment further confirms the impact on key business metrics. We believe that the flexibility of our framework makes it a promising solution toward enriching the user experience across the e-commerce landscape.

DashCLIP: Leveraging multimodal models for generating semantic embeddings for DoorDash

TL;DR

Abstract

DashCLIP: Leveraging multimodal models for generating semantic embeddings for DoorDash

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)