Table of Contents
Fetching ...

Level-Navi Agent: A Framework and benchmark for Chinese Web Search Agents

Chuanrui Hu, Shichong Xie, Baoxin Wang, Bin Chen, Xiaofeng Cong, Jun Zhang

TL;DR

This work tackles the challenge of evaluating Chinese web search with LLM-driven agents by introducing Level-Navi Agent, a training-free framework that hierarchically plans and retrieves information across internet levels. It couples this with Web24, a richly annotated Chinese web-search benchmark, and a tailored four-metric evaluation scheme, including a final score $S_{final}$. Empirical results show that larger open-source models like Qwen2.5-72B and Deepseek-V2.5 perform best on Web24, while revealing issues such as overconfidence in calling web-search functions and low task fidelity on Chinese tasks. The study highlights the need for multilingual optimization and better information-source filtering to advance practical Chinese AI web search, and provides datasets and metrics to facilitate fair, training-free assessments across models.

Abstract

Large language models (LLMs), adopted to understand human language, drive the development of artificial intelligence (AI) web search agents. Compared to traditional search engines, LLM-powered AI search agents are capable of understanding and responding to complex queries with greater depth, enabling more accurate operations and better context recognition. However, little attention and effort has been paid to the Chinese web search, which results in that the capabilities of open-source models have not been uniformly and fairly evaluated. The difficulty lies in lacking three aspects: an unified agent framework, an accurately labeled dataset, and a suitable evaluation metric. To address these issues, we propose a general-purpose and training-free web search agent by level-aware navigation, Level-Navi Agent, accompanied by a well-annotated dataset (Web24) and a suitable evaluation metric. Level-Navi Agent can think through complex user questions and conduct searches across various levels on the internet to gather information for questions. Meanwhile, we provide a comprehensive evaluation of state-of-the-art LLMs under fair settings. To further facilitate future research, source code is available at Github.

Level-Navi Agent: A Framework and benchmark for Chinese Web Search Agents

TL;DR

This work tackles the challenge of evaluating Chinese web search with LLM-driven agents by introducing Level-Navi Agent, a training-free framework that hierarchically plans and retrieves information across internet levels. It couples this with Web24, a richly annotated Chinese web-search benchmark, and a tailored four-metric evaluation scheme, including a final score . Empirical results show that larger open-source models like Qwen2.5-72B and Deepseek-V2.5 perform best on Web24, while revealing issues such as overconfidence in calling web-search functions and low task fidelity on Chinese tasks. The study highlights the need for multilingual optimization and better information-source filtering to advance practical Chinese AI web search, and provides datasets and metrics to facilitate fair, training-free assessments across models.

Abstract

Large language models (LLMs), adopted to understand human language, drive the development of artificial intelligence (AI) web search agents. Compared to traditional search engines, LLM-powered AI search agents are capable of understanding and responding to complex queries with greater depth, enabling more accurate operations and better context recognition. However, little attention and effort has been paid to the Chinese web search, which results in that the capabilities of open-source models have not been uniformly and fairly evaluated. The difficulty lies in lacking three aspects: an unified agent framework, an accurately labeled dataset, and a suitable evaluation metric. To address these issues, we propose a general-purpose and training-free web search agent by level-aware navigation, Level-Navi Agent, accompanied by a well-annotated dataset (Web24) and a suitable evaluation metric. Level-Navi Agent can think through complex user questions and conduct searches across various levels on the internet to gather information for questions. Meanwhile, we provide a comprehensive evaluation of state-of-the-art LLMs under fair settings. To further facilitate future research, source code is available at Github.

Paper Structure

This paper contains 15 sections, 1 equation, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Pipeline of Our Level-Navi Agent.
  • Figure 2: The framework of Level-Navi Agent.
  • Figure 3: Our Level-Navi Agent demonstrates an example of handling a user query. The Planning Agent first requests the collection of information about nominated games. After receiving feedback, it then proceeds to search for the release date of each game in parallel (we translate the process from Chinese to English for better understanding).
  • Figure 4: Source, domain and type of Web24 Dateset.
  • Figure 5: Comparison with other products based on our metrics.
  • ...and 1 more figures