Table of Contents
Fetching ...

The AI Consumer Index (ACE)

Julien Benchek, Rohit Shetty, Benjamin Hunsberger, Ajay Arun, Zach Richards, Brendan Foody, Osvald Nitski, Bertie Vidgen

TL;DR

ACE-v1 introduces a consumer-focused benchmark with a heldout 400-task set and an open 80-task dev set, assessed with a grounding-based, hurdle-driven rubric across four domains (Shopping, DIY, Gaming, Food). The evaluation uses web-grounding verification and multi-run grading to quantify how faithfully frontier AI systems ground claims and meet user objectives. Key findings show strong cross-domain variation and a persistent gap between top performers and everyday consumer needs, particularly in grounding and price-related claims. The paper also discusses limitations, contamination risks, and directions for expanding domain coverage, modalities, and dynamic evaluation on the changing internet.

Abstract

We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) at 55.2% and GPT 5.1 (Thinking = High) at 55.1%. Model scores differ across domains, and in Shopping the top model scores under 50\%. We find that models are prone to hallucinating key information, such as prices. ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.

The AI Consumer Index (ACE)

TL;DR

ACE-v1 introduces a consumer-focused benchmark with a heldout 400-task set and an open 80-task dev set, assessed with a grounding-based, hurdle-driven rubric across four domains (Shopping, DIY, Gaming, Food). The evaluation uses web-grounding verification and multi-run grading to quantify how faithfully frontier AI systems ground claims and meet user objectives. Key findings show strong cross-domain variation and a persistent gap between top performers and everyday consumer needs, particularly in grounding and price-related claims. The paper also discusses limitations, contamination risks, and directions for expanding domain coverage, modalities, and dynamic evaluation on the changing internet.

Abstract

We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) at 55.2% and GPT 5.1 (Thinking = High) at 55.1%. Model scores differ across domains, and in Shopping the top model scores under 50\%. We find that models are prone to hallucinating key information, such as prices. ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.

Paper Structure

This paper contains 24 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The ACE leaderboard (ACE-v1-heldout).
  • Figure 2: Example rubric for Shopping (ID 676) with 9 criteria. This case is from ACE-v1-dev and is not used in the ACE leaderboard.
  • Figure 3: Overview of the production process for creating cases in the AI Consumer Index. Quality control is applied at every step.
  • Figure 4: Hierarchical process for grading criteria in ACE-v1.
  • Figure 5: The net difference in pass rates, comparing grounded criteria with all criteria. Negative scores indicate that models are, relatively, worse at grounding their responses than they are replying to meet the requirement of prompts.