Table of Contents
Fetching ...

Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning

Sky CH-Wang, Darshan Deshpande, Smaranda Muresan, Anand Kannappan, Rebecca Qian

TL;DR

BLUR presents a demanding tip-of-the-tongue known-item benchmark with 573 multimodal and multilingual questions to probe general AI assistants' ability to perform multi-hop reasoning and tool use. The dataset enforces unambiguous ground truths, uses a two-stage prompt design and private test sets to ensure fair evaluation, and benchmarks a wide range of systems against human validators with automated LLM judgments. Key findings show humans substantially outperform current agents, with tool use offering only marginal gains and significant domain-dependent variation, especially for place-related queries. The work provides a public developer subset and a scalable evaluation framework that emphasizes real-world information needs, aiming to advance robust, tool-driven reasoning in general AI systems.

Abstract

We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants. BLUR introduces a set of 573 real-world validated questions that demand searching and reasoning across multi-modal and multilingual inputs, as well as proficient tool use, in order to excel on. Humans easily ace these questions (scoring on average 98%), while the best-performing system scores around 56%. To facilitate progress toward addressing this challenging and aspirational use case for general AI assistants, we release 350 questions through a public leaderboard, retain the answers to 250 of them, and have the rest as a private test set.

Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning

TL;DR

BLUR presents a demanding tip-of-the-tongue known-item benchmark with 573 multimodal and multilingual questions to probe general AI assistants' ability to perform multi-hop reasoning and tool use. The dataset enforces unambiguous ground truths, uses a two-stage prompt design and private test sets to ensure fair evaluation, and benchmarks a wide range of systems against human validators with automated LLM judgments. Key findings show humans substantially outperform current agents, with tool use offering only marginal gains and significant domain-dependent variation, especially for place-related queries. The work provides a public developer subset and a scalable evaluation framework that emphasizes real-world information needs, aiming to advance robust, tool-driven reasoning in general AI systems.

Abstract

We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants. BLUR introduces a set of 573 real-world validated questions that demand searching and reasoning across multi-modal and multilingual inputs, as well as proficient tool use, in order to excel on. Humans easily ace these questions (scoring on average 98%), while the best-performing system scores around 56%. To facilitate progress toward addressing this challenging and aspirational use case for general AI assistants, we release 350 questions through a public leaderboard, retain the answers to 250 of them, and have the rest as a private test set.

Paper Structure

This paper contains 32 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: A sample text-only query and its corresponding answer from our BLUR dataset. An example that is multimodal on input is shown in Figure \ref{['fig:prompt_response_example']}.
  • Figure 2: Distribution of prompts across topical domains, file extensions for prompts with attached files, and non-English-language-only prompt languages. Note that 75% of prompts are text-only without attached files, and 70% are exclusively in English.
  • Figure 3: Distribution of response times taken by validators to answer questions (top) and the breakdown of query difficulty levels (bottom) as characterized by splits based on those times.
  • Figure 4: Evaluation instance example of a medium-difficulty query with a file input. The prompt scaffold is shown, along with the output provided by ChatGPT 4o and LLM Judge weak string match result. The step-by-step validation chain that a human validator took to answer this query is shown in Figure \ref{['fig:validation_chain']}. Code execution in output is highlighted in gray, while web search is highlighted in orange; both are truncated for simplicity.
  • Figure 5: Steps that a human validator took to answer the query shown in Figure \ref{['fig:prompt_response_example']}. The time spent by the validator to arrive at the answer was fifteen minutes and forty-five seconds.
  • ...and 1 more figures