Table of Contents
Fetching ...

More than Marketing? On the Information Value of AI Benchmarks for Practitioners

Amelia Hardy, Anka Reuel, Kiana Jafari Meimandi, Lisa Soder, Allie Griffith, Dylan M. Asmar, Sanmi Koyejo, Michael S. Bernstein, Mykel J. Kochenderfer

TL;DR

This paper investigates how AI benchmarks influence practical decision-making, revealing that practitioners mainly use benchmarks as relative signals of model progress rather than as absolute determinants for deployment. It draws on 19 semi-structured interviews across academia, policy, and industry, and leverages IT adoption theory (UTAUT) to explain adoption gaps, particularly low performance expectancy due to misalignment with real-use tasks. The authors argue for more informative, domain-informed benchmarks that involve domain experts, transparently define scope and goals, and include measures to prevent data contamination, while maintaining the necessity of human evaluation. The work highlights that benchmarks can drive research progress but cannot substitute real-world testing, especially in high-stakes or safety-critical contexts, and offers design principles to make benchmarks more decision-relevant. This has practical impact for benchmark developers and organizations seeking reliable evaluative tools aligned with real-world deployment needs.

Abstract

Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of interviews with 19 individuals who have used, or decided against using, benchmarks in their day-to-day work, we find that across these settings, participants use benchmarks as a signal of relative performance difference between models. However, whether this signal was considered a definitive sign of model superiority, sufficient for downstream decisions, varied. In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. By contrast, in both product and policy, benchmarks -- even those developed internally for specific tasks -- were often found to be inadequate for informing substantive decisions. Of the benchmarks deemed unsatisfactory, respondents reported that their goals were neither well-defined nor reflective of real-world use. Based on the study results, we conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals. They must capture diverse, task-relevant capabilities, be challenging enough to avoid quick saturation, and account for trade-offs in model performance rather than relying on a single score. Additionally, proprietary data collection and contamination prevention are critical for producing reliable and actionable results. By adhering to these criteria, benchmarks can move beyond mere marketing tricks into robust evaluative frameworks.

More than Marketing? On the Information Value of AI Benchmarks for Practitioners

TL;DR

This paper investigates how AI benchmarks influence practical decision-making, revealing that practitioners mainly use benchmarks as relative signals of model progress rather than as absolute determinants for deployment. It draws on 19 semi-structured interviews across academia, policy, and industry, and leverages IT adoption theory (UTAUT) to explain adoption gaps, particularly low performance expectancy due to misalignment with real-use tasks. The authors argue for more informative, domain-informed benchmarks that involve domain experts, transparently define scope and goals, and include measures to prevent data contamination, while maintaining the necessity of human evaluation. The work highlights that benchmarks can drive research progress but cannot substitute real-world testing, especially in high-stakes or safety-critical contexts, and offers design principles to make benchmarks more decision-relevant. This has practical impact for benchmark developers and organizations seeking reliable evaluative tools aligned with real-world deployment needs.

Abstract

Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of interviews with 19 individuals who have used, or decided against using, benchmarks in their day-to-day work, we find that across these settings, participants use benchmarks as a signal of relative performance difference between models. However, whether this signal was considered a definitive sign of model superiority, sufficient for downstream decisions, varied. In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. By contrast, in both product and policy, benchmarks -- even those developed internally for specific tasks -- were often found to be inadequate for informing substantive decisions. Of the benchmarks deemed unsatisfactory, respondents reported that their goals were neither well-defined nor reflective of real-world use. Based on the study results, we conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals. They must capture diverse, task-relevant capabilities, be challenging enough to avoid quick saturation, and account for trade-offs in model performance rather than relying on a single score. Additionally, proprietary data collection and contamination prevention are critical for producing reliable and actionable results. By adhering to these criteria, benchmarks can move beyond mere marketing tricks into robust evaluative frameworks.

Paper Structure

This paper contains 28 sections, 1 table.