Table of Contents
Fetching ...

LLM-based Relevance Assessment for Web-Scale Search Evaluation at Pinterest

Han Wang, Alex Whitworth, Pak Ming Cheung, Zhenjie Zhang, Krishna Kamath

TL;DR

The paper tackles scalable relevance evaluation for online search experiments by deploying fine-tuned cross-encoder LLMs to predict query–Pin relevance on a 5-point scale. It demonstrates strong alignment with human judgments (Kendall’s $\tau > 0.5$ and $\rho > 0.65$) and shows substantial efficiency gains, reducing the Minimum Detectable Effect ($MDE$) to $\le 0.25\%$ through stratified sampling and Neyman allocation. The approach enables expanding the query set, refining sampling design, and assessing a wider range of search experiences at Pinterest with lower labeling costs. The work also validates multilingual performance, albeit with some degradation in non-English languages, and points to future work in Visual Language Models and broader multilingual support to further enhance online relevance metrics at scale.

Abstract

Relevance evaluation plays a crucial role in personalized search systems to ensure that search results align with a user's queries and intent. While human annotation is the traditional method for relevance evaluation, its high cost and long turnaround time limit its scalability. In this work, we present our approach at Pinterest Search to automate relevance evaluation for online experiments using fine-tuned LLMs. We rigorously validate the alignment between LLM-generated judgments and human annotations, demonstrating that LLMs can provide reliable relevance measurement for experiments while greatly improving the evaluation efficiency. Leveraging LLM-based labeling further unlocks the opportunities to expand the query set, optimize sampling design, and efficiently assess a wider range of search experiences at scale. This approach leads to higher-quality relevance metrics and significantly reduces the Minimum Detectable Effect (MDE) in online experiment measurements.

LLM-based Relevance Assessment for Web-Scale Search Evaluation at Pinterest

TL;DR

The paper tackles scalable relevance evaluation for online search experiments by deploying fine-tuned cross-encoder LLMs to predict query–Pin relevance on a 5-point scale. It demonstrates strong alignment with human judgments (Kendall’s and ) and shows substantial efficiency gains, reducing the Minimum Detectable Effect () to through stratified sampling and Neyman allocation. The approach enables expanding the query set, refining sampling design, and assessing a wider range of search experiences at Pinterest with lower labeling costs. The work also validates multilingual performance, albeit with some degradation in non-English languages, and points to future work in Visual Language Models and broader multilingual support to further enhance online relevance metrics at scale.

Abstract

Relevance evaluation plays a crucial role in personalized search systems to ensure that search results align with a user's queries and intent. While human annotation is the traditional method for relevance evaluation, its high cost and long turnaround time limit its scalability. In this work, we present our approach at Pinterest Search to automate relevance evaluation for online experiments using fine-tuned LLMs. We rigorously validate the alignment between LLM-generated judgments and human annotations, demonstrating that LLMs can provide reliable relevance measurement for experiments while greatly improving the evaluation efficiency. Leveraging LLM-based labeling further unlocks the opportunities to expand the query set, optimize sampling design, and efficiently assess a wider range of search experiences at scale. This approach leads to higher-quality relevance metrics and significantly reduces the Minimum Detectable Effect (MDE) in online experiment measurements.

Paper Structure

This paper contains 15 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The cross-encoder architecture in the relevance teacher model. Take the encoder language models (e.g., BERT-based models) for illustration.
  • Figure 2: Components of LLM-based relevance measurement at Pinterest Search.
  • Figure 3: Query-level $\boldsymbol{sDCG@K}$ error distribution for single group (left) and paired differences (right) in US market relevance evaluation.
  • Figure 4: Query-level $\boldsymbol{sDCG@K}$ error distribution for single group (left) and paired differences (right) in France (top) and Germany (bottom) markets relevance evaluation.