Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning
Sky CH-Wang, Darshan Deshpande, Smaranda Muresan, Anand Kannappan, Rebecca Qian
TL;DR
BLUR presents a demanding tip-of-the-tongue known-item benchmark with 573 multimodal and multilingual questions to probe general AI assistants' ability to perform multi-hop reasoning and tool use. The dataset enforces unambiguous ground truths, uses a two-stage prompt design and private test sets to ensure fair evaluation, and benchmarks a wide range of systems against human validators with automated LLM judgments. Key findings show humans substantially outperform current agents, with tool use offering only marginal gains and significant domain-dependent variation, especially for place-related queries. The work provides a public developer subset and a scalable evaluation framework that emphasizes real-world information needs, aiming to advance robust, tool-driven reasoning in general AI systems.
Abstract
We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants. BLUR introduces a set of 573 real-world validated questions that demand searching and reasoning across multi-modal and multilingual inputs, as well as proficient tool use, in order to excel on. Humans easily ace these questions (scoring on average 98%), while the best-performing system scores around 56%. To facilitate progress toward addressing this challenging and aspirational use case for general AI assistants, we release 350 questions through a public leaderboard, retain the answers to 250 of them, and have the rest as a private test set.
