Table of Contents
Fetching ...

AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning

Lang Mei, Zhihan Yang, Xiaohan Yu, Huanyao Zhang, Chong Chen

TL;DR

AI-SearchPlanner introduces a decoupled, RL-based framework that assigns search planning to a trainable small LLM while delegating QA to a large frozen generator, enabling scalable and efficient end-to-end QA. It leverages a dual-reward scheme (outcome and process) and Pareto optimization to balance QA accuracy against planning cost, trained with PPO and loss masking to focus gradients on planning decisions. Empirical results across multiple QA benchmarks show strong gains over baselines, with demonstrated cross-domain transferability and a clear planning-effort vs. performance trade-off. The approach offers a practical pathway to deploying capable, resource-efficient AI search agents in real-world settings.

Abstract

Recent studies have explored integrating Large Language Models (LLMs) with search engines to leverage both the LLMs' internal pre-trained knowledge and external information. Specially, reinforcement learning (RL) has emerged as a promising paradigm for enhancing LLM reasoning through multi-turn interactions with search engines. However, existing RL-based search agents rely on a single LLM to handle both search planning and question-answering (QA) tasks in an end-to-end manner, which limits their ability to optimize both capabilities simultaneously. In practice, sophisticated AI search systems often employ a large, frozen LLM (e.g., GPT-4, DeepSeek-R1) to ensure high-quality QA. Thus, a more effective and efficient approach is to utilize a small, trainable LLM dedicated to search planning. In this paper, we propose \textbf{AI-SearchPlanner}, a novel reinforcement learning framework designed to enhance the performance of frozen QA models by focusing on search planning. Specifically, our approach introduces three key innovations: 1) Decoupling the Architecture of the Search Planner and Generator, 2) Dual-Reward Alignment for Search Planning, and 3) Pareto Optimization of Planning Utility and Cost, to achieve the objectives. Extensive experiments on real-world datasets demonstrate that AI SearchPlanner outperforms existing RL-based search agents in both effectiveness and efficiency, while exhibiting strong generalization capabilities across diverse frozen QA models and data domains.

AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning

TL;DR

AI-SearchPlanner introduces a decoupled, RL-based framework that assigns search planning to a trainable small LLM while delegating QA to a large frozen generator, enabling scalable and efficient end-to-end QA. It leverages a dual-reward scheme (outcome and process) and Pareto optimization to balance QA accuracy against planning cost, trained with PPO and loss masking to focus gradients on planning decisions. Empirical results across multiple QA benchmarks show strong gains over baselines, with demonstrated cross-domain transferability and a clear planning-effort vs. performance trade-off. The approach offers a practical pathway to deploying capable, resource-efficient AI search agents in real-world settings.

Abstract

Recent studies have explored integrating Large Language Models (LLMs) with search engines to leverage both the LLMs' internal pre-trained knowledge and external information. Specially, reinforcement learning (RL) has emerged as a promising paradigm for enhancing LLM reasoning through multi-turn interactions with search engines. However, existing RL-based search agents rely on a single LLM to handle both search planning and question-answering (QA) tasks in an end-to-end manner, which limits their ability to optimize both capabilities simultaneously. In practice, sophisticated AI search systems often employ a large, frozen LLM (e.g., GPT-4, DeepSeek-R1) to ensure high-quality QA. Thus, a more effective and efficient approach is to utilize a small, trainable LLM dedicated to search planning. In this paper, we propose \textbf{AI-SearchPlanner}, a novel reinforcement learning framework designed to enhance the performance of frozen QA models by focusing on search planning. Specifically, our approach introduces three key innovations: 1) Decoupling the Architecture of the Search Planner and Generator, 2) Dual-Reward Alignment for Search Planning, and 3) Pareto Optimization of Planning Utility and Cost, to achieve the objectives. Extensive experiments on real-world datasets demonstrate that AI SearchPlanner outperforms existing RL-based search agents in both effectiveness and efficiency, while exhibiting strong generalization capabilities across diverse frozen QA models and data domains.

Paper Structure

This paper contains 20 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overview of AI-SearchPlanner framework.
  • Figure 2: Training dynamics of AI-SearchPlanner with cost coefficient $\alpha$ = 0.
  • Figure 3: Utility-Cost tradeoffs on Wikipedia-based datasets. Blue points represent non-planning baselines. Orange points represent AI-SearchPlanner with differet cost coefficient $\alpha$.