Table of Contents
Fetching ...

PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest

Josh Beal, Eric Kim, Jinfeng Rao, Rex Wu, Dmitry Kislyuk, Charles Rosenberg

TL;DR

A novel hybrid Vision Transformer architecture is proposed that utilizes a VLM backbone and a hybrid fusion mechanism to capture multi-modality content representation at varying granularities to enhance retrieval and ranking models at Pinterest by leveraging VLMs to learn image-text alignment.

Abstract

While multi-modal Visual Language Models (VLMs) have demonstrated significant success across various domains, the integration of VLMs into recommendation and retrieval systems remains a challenge, due to issues like training objective discrepancies and serving efficiency bottlenecks. This paper introduces PinCLIP, a large-scale visual representation learning approach developed to enhance retrieval and ranking models at Pinterest by leveraging VLMs to learn image-text alignment. We propose a novel hybrid Vision Transformer architecture that utilizes a VLM backbone and a hybrid fusion mechanism to capture multi-modality content representation at varying granularities. Beyond standard image-to-text alignment objectives, we introduce a neighbor alignment objective to model the cross-fusion of multi-modal representations within the Pinterest Pin-Board graph. Offline evaluations show that PinCLIP outperforms state-of-the-art baselines, such as Qwen, by 20% in multi-modal retrieval tasks. Online A/B testing demonstrates significant business impact, including substantial engagement gains across all major surfaces in Pinterest. Notably, PinCLIP significantly addresses the "cold-start" problem, enhancing fresh content distribution with a 15% Repin increase in organic content and 8.7% higher click for new Ads.

PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest

TL;DR

A novel hybrid Vision Transformer architecture is proposed that utilizes a VLM backbone and a hybrid fusion mechanism to capture multi-modality content representation at varying granularities to enhance retrieval and ranking models at Pinterest by leveraging VLMs to learn image-text alignment.

Abstract

While multi-modal Visual Language Models (VLMs) have demonstrated significant success across various domains, the integration of VLMs into recommendation and retrieval systems remains a challenge, due to issues like training objective discrepancies and serving efficiency bottlenecks. This paper introduces PinCLIP, a large-scale visual representation learning approach developed to enhance retrieval and ranking models at Pinterest by leveraging VLMs to learn image-text alignment. We propose a novel hybrid Vision Transformer architecture that utilizes a VLM backbone and a hybrid fusion mechanism to capture multi-modality content representation at varying granularities. Beyond standard image-to-text alignment objectives, we introduce a neighbor alignment objective to model the cross-fusion of multi-modal representations within the Pinterest Pin-Board graph. Offline evaluations show that PinCLIP outperforms state-of-the-art baselines, such as Qwen, by 20% in multi-modal retrieval tasks. Online A/B testing demonstrates significant business impact, including substantial engagement gains across all major surfaces in Pinterest. Notably, PinCLIP significantly addresses the "cold-start" problem, enhancing fresh content distribution with a 15% Repin increase in organic content and 8.7% higher click for new Ads.
Paper Structure (27 sections, 12 equations, 7 figures, 5 tables)

This paper contains 27 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the PinCLIP fusion model architecture, highlighting the design of the Hybrid Vision Transformer backbone and the multimodal contrastive objectives for jointly learning image-text alignment and Pin-to-Pin alignment.
  • Figure 2: Illustration of the text-image dataset. Each image ("Pin") is associated with multiple text signals. We use longer-form descriptive text signals (title, description, image caption) and shorter-form keyword text signals (navboost search query, annotations).
  • Figure 3: Global affine embedding quantization. $x$ is the original full-precision embedding. $x_q$ is the quantized (lower-precision) embedding. $x_{dq}$ is the reconstructed embedding used in downstream tasks. Scalar $s, z$ are the scale and zero-point parameters respectively.
  • Figure 4: Retrieval results (Recall@K) for the key evaluation tasks of PinText, Related Pins, and Search.
  • Figure 5: Assessment of the impact of dataset scaling on PinText image-to-text retrieval performance.
  • ...and 2 more figures