Table of Contents
Fetching ...

Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

Ziliang Wang, Kang An, Xuhui Zheng, Faqiang Qian, Weikun Zhang, Cijun Ouyang, Jialu Cai, Yuhang Wang, Yichao Wu

TL;DR

The paper tackles brittleness in retrieval-augmented LLMs for multi-hop QA by identifying three core failure modes: decomposition errors, retrieval omissions, and reasoning mistakes. It introduces Erasable Reinforcement Learning (ERL), which detects faulty steps, erases them, and regenerates reasoning from the last correct state, turning fragile trajectories into resilient ones. The approach combines round-based reasoning with dense, multi-part rewards and specialized erasure operators, and demonstrates SOTA performance on HotpotQA, MuSiQue, 2WikiMultiHopQA, and Bamboogle, with notable gains at both 3B and 7B scales. ERL also outperforms classical RL baselines and shows strong gains in online retrieval settings, underscoring its potential to improve robustness and reliability of complex, multi-step QA systems in real-world, dynamic information environments.

Abstract

While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.

Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

TL;DR

The paper tackles brittleness in retrieval-augmented LLMs for multi-hop QA by identifying three core failure modes: decomposition errors, retrieval omissions, and reasoning mistakes. It introduces Erasable Reinforcement Learning (ERL), which detects faulty steps, erases them, and regenerates reasoning from the last correct state, turning fragile trajectories into resilient ones. The approach combines round-based reasoning with dense, multi-part rewards and specialized erasure operators, and demonstrates SOTA performance on HotpotQA, MuSiQue, 2WikiMultiHopQA, and Bamboogle, with notable gains at both 3B and 7B scales. ERL also outperforms classical RL baselines and shows strong gains in online retrieval settings, underscoring its potential to improve robustness and reliability of complex, multi-step QA systems in real-world, dynamic information environments.

Abstract

While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.

Paper Structure

This paper contains 37 sections, 21 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Overview of ESearch. Different colors and symbols are used to represent the interactive behaviors S (Search), I (Information), O (Observation), and A (Sub Answer). In the answering process, there are three types of erasure and retry behaviors: (1) incorrect initial search results trigger initialization plan erasure; (2) incorrect subsequent search results trigger search design erasure; (3) incorrect sub-answer triggers observation erasure. In addition, history checking and target set are built for searches and sub-answers respectively to evaluate value gains, which serve as the basis for erasure triggers and reward calculation.
  • Figure 2: Training dynamics of different RL strategies. Compared to PPO, GRPO demonstrates faster learning speed and reward acquisition during training but tends to suffer from instability and potential collapse in the later stages of Search Agent tasks. In contrast, ERL achieves higher learning efficiency than PPO while maintaining training stability.
  • Figure 3: Training dynamics in ablation experiments.
  • Figure 4: Overview of ESearch. In the figure, 'o/w' (only with) indicates that only the current mechanism is added to the base method.
  • Figure 5: LLM interacts with external search engines and provides answers to prompt templates. The {question} will be replaced with the actual question content.
  • ...and 8 more figures