Table of Contents
Fetching ...

DashCLIP: Leveraging multimodal models for generating semantic embeddings for DoorDash

Omkar Gurjar, Kin Sum Liu, Praveen Kolli, Utsaw Kumar, Mandar Rahurkar

TL;DR

DashCLIP tackles the challenge of producing high-quality semantic embeddings for e-commerce products and user queries without relying on engagement history. It introduces a two-stage training framework that first continually pre-trains product encoders on the catalog and then aligns product and query embeddings via a Product-Query contrastive loss using an LLM-curated relevance dataset. The approach achieves strong generalization across retrieval, relevance prediction, and ads ranking, with notable improvements in offline CTR/AUC and positive online deployment metrics. These results demonstrate the practical impact of domain-adapted, multimodal embeddings for search and personalized advertising in large-scale e-commerce platforms.

Abstract

Despite the success of vision-language models in various generative tasks, obtaining high-quality semantic representations for products and user intents is still challenging due to the inability of off-the-shelf models to capture nuanced relationships between the entities. In this paper, we introduce a joint training framework for product and user queries by aligning uni-modal and multi-modal encoders through contrastive learning on image-text data. Our novel approach trains a query encoder with an LLM-curated relevance dataset, eliminating the reliance on engagement history. These embeddings demonstrate strong generalization capabilities and improve performance across applications, including product categorization and relevance prediction. For personalized ads recommendation, a significant uplift in the click-through rate and conversion rate after the deployment further confirms the impact on key business metrics. We believe that the flexibility of our framework makes it a promising solution toward enriching the user experience across the e-commerce landscape.

DashCLIP: Leveraging multimodal models for generating semantic embeddings for DoorDash

TL;DR

DashCLIP tackles the challenge of producing high-quality semantic embeddings for e-commerce products and user queries without relying on engagement history. It introduces a two-stage training framework that first continually pre-trains product encoders on the catalog and then aligns product and query embeddings via a Product-Query contrastive loss using an LLM-curated relevance dataset. The approach achieves strong generalization across retrieval, relevance prediction, and ads ranking, with notable improvements in offline CTR/AUC and positive online deployment metrics. These results demonstrate the practical impact of domain-adapted, multimodal embeddings for search and personalized advertising in large-scale e-commerce platforms.

Abstract

Despite the success of vision-language models in various generative tasks, obtaining high-quality semantic representations for products and user intents is still challenging due to the inability of off-the-shelf models to capture nuanced relationships between the entities. In this paper, we introduce a joint training framework for product and user queries by aligning uni-modal and multi-modal encoders through contrastive learning on image-text data. Our novel approach trains a query encoder with an LLM-curated relevance dataset, eliminating the reliance on engagement history. These embeddings demonstrate strong generalization capabilities and improve performance across applications, including product categorization and relevance prediction. For personalized ads recommendation, a significant uplift in the click-through rate and conversion rate after the deployment further confirms the impact on key business metrics. We believe that the flexibility of our framework makes it a promising solution toward enriching the user experience across the e-commerce landscape.

Paper Structure

This paper contains 26 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Model Architecture and Training Objectives of DashCLIP. We perform the training in two stages. Stage 1 (colored blue): image and text uni-modal encoders are trained using the Image-Text contrastive (ITC) loss, and multi-modal image-text encoder is trained using the Image-Text matching (ITM) loss. Stage 2 (colored green): We train the multi-modal projection layers and the query encoder using the Product-Query contrastive (PQC) loss. Dotted line represents shared weights.
  • Figure 2: Sampling strategy from Query Product Relevance dataset for training the PQC objective. The numbers represent the relative frequency ratio between respective relevance types of queries.
  • Figure 3: Model architecture for integrating the embedding features with the existing pCTR model. The outputs of the two-towers are concatenated and then passed through fully-connected layers to obtain the final click probability.
  • Figure 4: Scatter plot of product embeddings after t-SNE dimensionality reduction for top-10 aisle categories by frequency. Products from the same categories form clusters naturally. Similar clusters like Drinks and Alcohol are closer to each other. Cluster of unique categories like Pet Care is isolated from the majority mass.
  • Figure 5: Distribution of cosine similarity between product and query embedding from off the shelf BLIP-14M (top) and DashCLIP (bottom). Our embedding is able to achieve a clear separation between the three relevance classes showing the effectiveness of PQC loss.