Table of Contents
Fetching ...

Decision Quality Evaluation Framework at Pinterest

Yuqi Tian, Robert Paine, Attila Dobi, Kevin O'Sullivan, Aravindh Manickavasagam, Faisal Farooq

TL;DR

This work presents a comprehensive Decision Quality Evaluation Framework developed and deployed at Pinterest, centered on a high-trust Golden Set curated by subject matter experts (SMEs), which serves as a ground truth benchmark.

Abstract

Online platforms require robust systems to enforce content safety policies at scale. A critical component of these systems is the ability to evaluate the quality of moderation decisions made by both human agents and Large Language Models (LLMs). However, this evaluation is challenging due to the inherent trade-offs between cost, scale, and trustworthiness, along with the complexity of evolving policies. To address this, we present a comprehensive Decision Quality Evaluation Framework developed and deployed at Pinterest. The framework is centered on a high-trust Golden Set (GDS) curated by subject matter experts (SMEs), which serves as a ground truth benchmark. We introduce an automated intelligent sampling pipeline that uses propensity scores to efficiently expand dataset coverage. We demonstrate the framework's practical application in several key areas: benchmarking the cost-performance trade-offs of various LLM agents, establishing a rigorous methodology for data-driven prompt optimization, managing complex policy evolution, and ensuring the integrity of policy content prevalence metrics via continuous validation. The framework enables a shift from subjective assessments to a data-driven and quantitative practice for managing content safety systems.

Decision Quality Evaluation Framework at Pinterest

TL;DR

This work presents a comprehensive Decision Quality Evaluation Framework developed and deployed at Pinterest, centered on a high-trust Golden Set curated by subject matter experts (SMEs), which serves as a ground truth benchmark.

Abstract

Online platforms require robust systems to enforce content safety policies at scale. A critical component of these systems is the ability to evaluate the quality of moderation decisions made by both human agents and Large Language Models (LLMs). However, this evaluation is challenging due to the inherent trade-offs between cost, scale, and trustworthiness, along with the complexity of evolving policies. To address this, we present a comprehensive Decision Quality Evaluation Framework developed and deployed at Pinterest. The framework is centered on a high-trust Golden Set (GDS) curated by subject matter experts (SMEs), which serves as a ground truth benchmark. We introduce an automated intelligent sampling pipeline that uses propensity scores to efficiently expand dataset coverage. We demonstrate the framework's practical application in several key areas: benchmarking the cost-performance trade-offs of various LLM agents, establishing a rigorous methodology for data-driven prompt optimization, managing complex policy evolution, and ensuring the integrity of policy content prevalence metrics via continuous validation. The framework enables a shift from subjective assessments to a data-driven and quantitative practice for managing content safety systems.
Paper Structure (11 sections, 3 equations, 4 figures, 1 table)

This paper contains 11 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The Hierarchy of Trustworthiness for Labeling Sources, illustrating the trade-off between label quality and scalability. Policy Experts are the authors and ultimate interpreters of the written policy. SMEs are highly-trained specialists who produce the highest-quality labels for creating ground truth datasets like the GDS. Leads are experienced reviewers who oversee the quality of the scalable agent workforce at the base.
  • Figure 2: An illustration of design trade-offs. The GDS (left) is optimized for Trustworthiness and Coverage. In contrast, typical Scalable Datasets (right), such as those from production-scale agents, are optimized for Size and low Cost. This is a conceptual diagram as we have not actually quantified values and scales for Size, Cost, Trustworthiness and Representativeness.
  • Figure 3: The automated evaluation framework, showing the three core workflows. The Update Workflow creates a new GDS version, which triggers the Metrics Workflow. The resulting coverage metric is fed back to inform future sampling, creating an intelligent loop.
  • Figure 4: The Sankey diagram visualizing the policy delta on the GDS, showing the flow of items from old SME labels to new SME labels after a policy update.