StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

Ziliang Wang; Xuhui Zheng; Kang An; Cijun Ouyang; Jialu Cai; Yuhang Wang; Yichao Wu

StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, Yichao Wu

TL;DR

StepSearch tackles the challenge of efficient multi-hop QA by introducing step-wise proximal policy optimization (StePPO) with token-level supervision and two reward channels to guide iterative retrieval. A MuSiQue-based data augmentation pipeline generates sub-question trajectories, enabling rich, stepwise learning signals. Across four multi-hop QA benchmarks and with 3B–7B Qwen models, StepSearch achieves state-of-the-art improvements using only 19k training examples, demonstrating faster convergence and better retrieval fidelity than prior RL-based search methods. The approach offers a robust, data-efficient path to enhancing retrieval-augmented LLMs, with potential to extend to larger models and multimodal tasks.

Abstract

Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. Our code will be released on https://github.com/Zillwang/StepSearch.

StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

TL;DR

Abstract

StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)