Table of Contents
Fetching ...

Peeking Behind Closed Doors: Risks of LLM Evaluation by Private Data Curators

Hritik Bansal, Pratyush Maini

TL;DR

The paper argues that private, curator-run LLM evaluations can introduce financial conflicts and subjective annotator biases that undermine evaluation reliability. It formalizes this concern by simulating a controlled setting with two evaluators and two models to quantify preference and ranking biases. Results show measurable self-bias (approximately 10.5%–12.9%) and ELO-based ranking shifts depending on the evaluator, signaling that private evaluations may reflect evaluator quirks rather than true general performance. The work advocates independent, transparent evaluation practices and disclosure of potential conflicts to ensure credible assessments in the LLM ecosystem.

Abstract

The rapid advancement in building large language models (LLMs) has intensified competition among big-tech companies and AI startups. In this regard, model evaluations are critical for product and investment-related decision-making. While open evaluation sets like MMLU initially drove progress, concerns around data contamination and data bias have constantly questioned their reliability. As a result, it has led to the rise of private data curators who have begun conducting hidden evaluations with high-quality self-curated test prompts and their own expert annotators. In this paper, we argue that despite potential advantages in addressing contamination issues, private evaluations introduce inadvertent financial and evaluation risks. In particular, the key concerns include the potential conflict of interest arising from private data curators' business relationships with their clients (leading LLM firms). In addition, we highlight that the subjective preferences of private expert annotators will lead to inherent evaluation bias towards the models trained with the private curators' data. Overall, this paper lays the foundation for studying the risks of private evaluations that can lead to wide-ranging community discussions and policy changes.

Peeking Behind Closed Doors: Risks of LLM Evaluation by Private Data Curators

TL;DR

The paper argues that private, curator-run LLM evaluations can introduce financial conflicts and subjective annotator biases that undermine evaluation reliability. It formalizes this concern by simulating a controlled setting with two evaluators and two models to quantify preference and ranking biases. Results show measurable self-bias (approximately 10.5%–12.9%) and ELO-based ranking shifts depending on the evaluator, signaling that private evaluations may reflect evaluator quirks rather than true general performance. The work advocates independent, transparent evaluation practices and disclosure of potential conflicts to ensure credible assessments in the LLM ecosystem.

Abstract

The rapid advancement in building large language models (LLMs) has intensified competition among big-tech companies and AI startups. In this regard, model evaluations are critical for product and investment-related decision-making. While open evaluation sets like MMLU initially drove progress, concerns around data contamination and data bias have constantly questioned their reliability. As a result, it has led to the rise of private data curators who have begun conducting hidden evaluations with high-quality self-curated test prompts and their own expert annotators. In this paper, we argue that despite potential advantages in addressing contamination issues, private evaluations introduce inadvertent financial and evaluation risks. In particular, the key concerns include the potential conflict of interest arising from private data curators' business relationships with their clients (leading LLM firms). In addition, we highlight that the subjective preferences of private expert annotators will lead to inherent evaluation bias towards the models trained with the private curators' data. Overall, this paper lays the foundation for studying the risks of private evaluations that can lead to wide-ranging community discussions and policy changes.

Paper Structure

This paper contains 15 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of private evaluation pipeline showing the relationship between data curators, annotators, and model developers. The dual role of data curators in providing training data and evaluating models can introduce biases. NOTE: The leaderboard is only for representation purposes. We don't see this effect in the current SEAL rankings.
  • Figure 2: Screenshot showing OpenAI and Cohere as customers of ScaleAI, while ScaleAI also evaluates their models.
  • Figure 3: An ELO difference of 44 on the LMSys leaderboard allows for significant bragging rights as demonstrated by public discussions of even smaller ELO differences of 5-20 points.
  • Figure 4: Demonstration of a simple adversarial strategy that can allow a user to maliciously vote for a model of their choice on LMSys Arena. One can simply ask the model to reveal its identity before answering.
  • Figure 5: The North Cascades view that inspired our discussions on LLM evaluation bias.