Table of Contents
Fetching ...

No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

Tao Zhang, Kehui Yao, Luyi Ma, Jiao Chen, Reza Yousefi Maragheh, Kai Zhao, Jianpeng Xu, Evren Korpeoglu, Sushant Kumar, Kannan Achan

TL;DR

The paper tackles whether large language models can serve as scalable judges for evaluating recommender systems, focusing on Complementary-Item Recommendation (CIR). It introduces ScalingEval, a multi-agent, no-human-in-the-loop framework that decomposes evaluation into CI pattern audits and issue audits and synthesizes ground-truth via majority voting across 36 diverse LLM families. Key findings show Claude-3.5-sonnet achieves the highest decision confidence, Gemini-1.5-pro the best overall performance, GPT-4o offers the best latency-accuracy-cost trade-off, and GPT-OSS-20B leads among open-source models, with strong category-level agreement in Electronics and Sports but more disagreement in Clothing and Food. The work provides a reproducible protocol for LLM-based evaluation, with actionable guidance on scaling evaluation pipelines and model-family tradeoffs applicable to large-scale CIR and beyond.

Abstract

Evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines. We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama, across multiple product categories using a consensus-driven evaluation protocol. Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting, enabling reproducible comparison of LLM evaluators without human annotation. Applied to large-scale complementary-item recommendation, the benchmark reports four key findings: (i) Anthropic Claude 3.5 Sonnet achieves the highest decision confidence; (ii) Gemini 1.5 Pro offers the best overall performance across categories; (iii) GPT-4o provides the most favorable latency-accuracy-cost tradeoff; and (iv) GPT-OSS 20B leads among open-source models. Category-level analysis shows strong consensus in structured domains (Electronics, Sports) but persistent disagreement in lifestyle categories (Clothing, Food). These results establish ScalingEval as a reproducible benchmark and evaluation protocol for LLMs as judges, with actionable guidance on scaling, reliability, and model family tradeoffs.

No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

TL;DR

The paper tackles whether large language models can serve as scalable judges for evaluating recommender systems, focusing on Complementary-Item Recommendation (CIR). It introduces ScalingEval, a multi-agent, no-human-in-the-loop framework that decomposes evaluation into CI pattern audits and issue audits and synthesizes ground-truth via majority voting across 36 diverse LLM families. Key findings show Claude-3.5-sonnet achieves the highest decision confidence, Gemini-1.5-pro the best overall performance, GPT-4o offers the best latency-accuracy-cost trade-off, and GPT-OSS-20B leads among open-source models, with strong category-level agreement in Electronics and Sports but more disagreement in Clothing and Food. The work provides a reproducible protocol for LLM-based evaluation, with actionable guidance on scaling evaluation pipelines and model-family tradeoffs applicable to large-scale CIR and beyond.

Abstract

Evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines. We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama, across multiple product categories using a consensus-driven evaluation protocol. Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting, enabling reproducible comparison of LLM evaluators without human annotation. Applied to large-scale complementary-item recommendation, the benchmark reports four key findings: (i) Anthropic Claude 3.5 Sonnet achieves the highest decision confidence; (ii) Gemini 1.5 Pro offers the best overall performance across categories; (iii) GPT-4o provides the most favorable latency-accuracy-cost tradeoff; and (iv) GPT-OSS 20B leads among open-source models. Category-level analysis shows strong consensus in structured domains (Electronics, Sports) but persistent disagreement in lifestyle categories (Clothing, Food). These results establish ScalingEval as a reproducible benchmark and evaluation protocol for LLMs as judges, with actionable guidance on scaling, reliability, and model family tradeoffs.

Paper Structure

This paper contains 11 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: LLM-Agentic Evaluation at Scale without human in the loop
  • Figure 2: The overview of ScalingEval framework.
  • Figure 3: Different category agreement rate under pair frequency and cumulative fraction.
  • Figure 4: Example of item pairs from Clothing, Food, Toys, and Pet Supplies and different LLM's output evaluation response for them.