Table of Contents
Fetching ...

O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering

Jianbiao Mei, Tao Hu, Daocheng Fu, Licheng Wen, Xuemeng Yang, Rong Wu, Pinlong Cai, Xinyu Cai, Xing Gao, Yu Yang, Chengjun Xie, Botian Shi, Yong Liu, Yu Qiao

TL;DR

The paper tackles the challenge that large language models rely on static pretraining knowledge, which hampers open-domain, up-to-date question answering. It introduces O2-Searcher, a reinforcement learning–based search agent that operates in a locally simulated search environment to acquire external information and decouple retrieval from internal reasoning, enabling effective open-ended and closed-ended QA. A unified training framework with carefully designed reward functions (Format, Diversity, Factual) guides the agent to classify problem types and adapt its answer strategies, while the O2-QA benchmark provides a high-quality testbed for open-ended QA across multiple domains. Empirical results show that a $3$B-parameter backbone can outperform state-of-the-art LLM agents on O2-QA and achieve competitive or state-of-the-art performance on various closed-ended benchmarks, highlighting the practicality and scalability of search-guided reasoning for robust open-domain QA.

Abstract

Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O$^2$-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O$^2$-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O$^2$-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O$^2$-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O$^2$-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.

O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering

TL;DR

The paper tackles the challenge that large language models rely on static pretraining knowledge, which hampers open-domain, up-to-date question answering. It introduces O2-Searcher, a reinforcement learning–based search agent that operates in a locally simulated search environment to acquire external information and decouple retrieval from internal reasoning, enabling effective open-ended and closed-ended QA. A unified training framework with carefully designed reward functions (Format, Diversity, Factual) guides the agent to classify problem types and adapt its answer strategies, while the O2-QA benchmark provides a high-quality testbed for open-ended QA across multiple domains. Empirical results show that a B-parameter backbone can outperform state-of-the-art LLM agents on O2-QA and achieve competitive or state-of-the-art performance on various closed-ended benchmarks, highlighting the practicality and scalability of search-guided reasoning for robust open-domain QA.

Abstract

Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.

Paper Structure

This paper contains 24 sections, 11 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Illustration of different characteristics of closed-ended and open-ended questions.
  • Figure 2: The construction of the knowledge corpus for open-ended questions.
  • Figure 3: We use multi-round conversations for modeling action trajectories to enhance interactivity. The agent reasons in <think> tags, searches via <search> (with specific queries in <query>), and answers in <answer> when ready. The search contents feedback by the search environment is presented in <learnings>. For the open-ended problem, multiple potential queries are generated.
  • Figure 4: GRPO training with the interaction with the search environment. The policy model is optimized using GRPO with interaction with the local search environment, leveraging a reference model and rollout outputs from the preceding policy model. For closed-ended questions, optimization is driven by a Factual reward. For open-ended questions targeting key findings, training is guided by composite reward signals consisting of Format, Diversity, and Factual rewards.
  • Figure 5: The evolution of response length, reward value, and valid search results across different training steps. Incorporating open-ended data yields superior training stability, longer average response lengths, and larger average search turns during the training procedure.
  • ...and 5 more figures