Table of Contents
Fetching ...

Needle in the Web: A Benchmark for Retrieving Targeted Web Pages in the Wild

Yumeng Wang, Tianyu Fan, Lingrui Xu, Chao Huang

TL;DR

Needle in the Web introduces a benchmark for evaluating LLM-based search agents on fuzzy, exploratory web queries. It generates 663 multi-domain queries with tunable difficulty and uses an automated, LLM-judged evaluation to require a single webpage that satisfies all implicit criteria. In experiments across three closed-source and three open-source agents, accuracy generally falls below 35% and varies by domain and difficulty, revealing substantial gaps in current retrieval and tool-use capabilities. The work highlights the need for uncertainty-aware, semantically robust, and verifiable retrieval systems and offers a modular framework to extend benchmarks across languages and modalities.

Abstract

Large Language Models (LLMs) have evolved from simple chatbots into sophisticated agents capable of automating complex real-world tasks, where browsing and reasoning over live web content is key to assessing retrieval and cognitive skills. Existing benchmarks like BrowseComp and xBench-DeepSearch emphasize complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search, namely queries that are vague and multifaceted, where users seek the most relevant webpage rather than a single factual answer. To address this gap, we introduce Needle in the Web, a novel benchmark specifically designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content in response to ambiguous, exploratory queries under varying levels of difficulty. Needle in the Web comprises 663 questions spanning seven distinct domains. To ensure high query quality and answer uniqueness, we employ a flexible methodology that reliably generates queries of controllable difficulty based on factual claims of web contents. We benchmark three leading LLMs and three agent-based search systems on Needle in the Web, finding that most models struggle: many achieve below 35% accuracy, and none consistently excel across domains or difficulty levels. These findings reveal that Needle in the Web presents a significant challenge for current search systems and highlights the open problem of effective fuzzy retrieval under semantic ambiguity.

Needle in the Web: A Benchmark for Retrieving Targeted Web Pages in the Wild

TL;DR

Needle in the Web introduces a benchmark for evaluating LLM-based search agents on fuzzy, exploratory web queries. It generates 663 multi-domain queries with tunable difficulty and uses an automated, LLM-judged evaluation to require a single webpage that satisfies all implicit criteria. In experiments across three closed-source and three open-source agents, accuracy generally falls below 35% and varies by domain and difficulty, revealing substantial gaps in current retrieval and tool-use capabilities. The work highlights the need for uncertainty-aware, semantically robust, and verifiable retrieval systems and offers a modular framework to extend benchmarks across languages and modalities.

Abstract

Large Language Models (LLMs) have evolved from simple chatbots into sophisticated agents capable of automating complex real-world tasks, where browsing and reasoning over live web content is key to assessing retrieval and cognitive skills. Existing benchmarks like BrowseComp and xBench-DeepSearch emphasize complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search, namely queries that are vague and multifaceted, where users seek the most relevant webpage rather than a single factual answer. To address this gap, we introduce Needle in the Web, a novel benchmark specifically designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content in response to ambiguous, exploratory queries under varying levels of difficulty. Needle in the Web comprises 663 questions spanning seven distinct domains. To ensure high query quality and answer uniqueness, we employ a flexible methodology that reliably generates queries of controllable difficulty based on factual claims of web contents. We benchmark three leading LLMs and three agent-based search systems on Needle in the Web, finding that most models struggle: many achieve below 35% accuracy, and none consistently excel across domains or difficulty levels. These findings reveal that Needle in the Web presents a significant challenge for current search systems and highlights the open problem of effective fuzzy retrieval under semantic ambiguity.

Paper Structure

This paper contains 26 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: An overview of model performance on Needle in the Web. Items on X-axis denote the source websites from which queries are collected.
  • Figure 2: A comparison between Complex Reasoning Search and Fuzzy Exploratory Search. Complex Reasoning Search follows a clear strategy and only involves factoid information. Fuzzy Exploratory Search, on the contrary, must deal with multi-faceted queries. It needs to identify the query's implicit requirements and find the most appropriate source.
  • Figure 3: A sample query of Needle in the Web. Each of the separate requirements may be satisfied by multiple webpages, yet only the webpage that meets all requirements is considered the correct answer.
  • Figure 4: An illustration of our automated query collection pipeline. Different selected claims undergo the same processing, their only difference is in the difficulty of final query.
  • Figure 5: A real example illustrating the typical errors that agents exhibit. Due to space limits, some contents were abbreviated using ellipses.

Theorems & Definitions (3)

  • Definition 1: Masked Criterion
  • Definition 2: Semantic Mention
  • Definition 3: Query Satisfaction