The AI Consumer Index (ACE)
Julien Benchek, Rohit Shetty, Benjamin Hunsberger, Ajay Arun, Zach Richards, Brendan Foody, Osvald Nitski, Bertie Vidgen
TL;DR
ACE-v1 introduces a consumer-focused benchmark with a heldout 400-task set and an open 80-task dev set, assessed with a grounding-based, hurdle-driven rubric across four domains (Shopping, DIY, Gaming, Food). The evaluation uses web-grounding verification and multi-run grading to quantify how faithfully frontier AI systems ground claims and meet user objectives. Key findings show strong cross-domain variation and a persistent gap between top performers and everyday consumer needs, particularly in grounding and price-related claims. The paper also discusses limitations, contamination risks, and directions for expanding domain coverage, modalities, and dynamic evaluation on the changing internet.
Abstract
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) at 55.2% and GPT 5.1 (Thinking = High) at 55.1%. Model scores differ across domains, and in Shopping the top model scores under 50\%. We find that models are prone to hallucinating key information, such as prices. ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
