Table of Contents
Fetching ...

SAGE: Scalable AI Governance & Evaluation

Benjamin Le, Xueying Lu, Nick Stern, Wenqiong Liu, Igor Lapchuk, Xiang Li, Baofen Zheng, Kevin Rosenberg, Jiewen Huang, Zhe Zhang, Abraham Cabangbang, Satej Milind Wagle, Jianqiang Shen, Raghavan Muthuregunathan, Abhinav Gupta, Mathew Teoh, Andrew Kirk, Thomas Kwan, Jingwei Wu, Wenjing Zhang

TL;DR

SAGE tackles the governance gap in industrial-scale AI search by formalizing human product judgment into a scalable evaluation signal through Policy, Precedent, and an LLM Surrogate Judge. A bidirectional calibration loop between policy, precedent, and judge co-evolves to resolve ambiguities, enabling near human-level agreement and producing a scalable Student Judge via distillation that is feasible for >10^7 offline annotations per day and >10^4 QPS online. Distillation preserves judgment quality, achieving substantial agreement (Student–Human κ ≥ 0.7) while reducing costs by about 92× compared to the teacher. Deployed in LinkedIn Search, SAGE enabled simulation-driven development, offline candidate screening, and policy oversight that detected regressions invisible to engagement metrics, yielding a measurable business impact in the form of a 0.25% uplift in daily active users. The work demonstrates that structured, versioned, and decomposed governance can be scaled to industrial AI systems, improving reliability, interpretability, and user outcomes.

Abstract

Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.

SAGE: Scalable AI Governance & Evaluation

TL;DR

SAGE tackles the governance gap in industrial-scale AI search by formalizing human product judgment into a scalable evaluation signal through Policy, Precedent, and an LLM Surrogate Judge. A bidirectional calibration loop between policy, precedent, and judge co-evolves to resolve ambiguities, enabling near human-level agreement and producing a scalable Student Judge via distillation that is feasible for >10^7 offline annotations per day and >10^4 QPS online. Distillation preserves judgment quality, achieving substantial agreement (Student–Human κ ≥ 0.7) while reducing costs by about 92× compared to the teacher. Deployed in LinkedIn Search, SAGE enabled simulation-driven development, offline candidate screening, and policy oversight that detected regressions invisible to engagement metrics, yielding a measurable business impact in the form of a 0.25% uplift in daily active users. The work demonstrates that structured, versioned, and decomposed governance can be scaled to industrial AI systems, improving reliability, interpretability, and user outcomes.

Abstract

Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.
Paper Structure (19 sections, 3 equations, 5 figures, 4 tables)

This paper contains 19 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: SAGE framework: Bidirectional Calibration produces a calibrated Teacher Judge, which distills into a scalable Student Judge that enables Simulation-Driven Development and Production Oversight.
  • Figure 2: Linear Cohen's kappa of Student Judge across major training iterations for Job Search and People Search.
  • Figure 3: Linear Cohen's kappa scores across different judge comparisons.
  • Figure 4: Our distillation cascade: a frontier Teacher Judge is distilled into a scalable Student Judge, which is further compressed into an Online Model for production ranking.
  • Figure 5: PMR@10 vs. Dismiss to Apply Ratio (D2A) in the presence of a production incident degrading search quality.