TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

Yutao Xie; Nathaniel Thomas; Nicklas Hansen; Yang Fu; Li Erran Li; Xiaolong Wang

TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, Xiaolong Wang

Abstract

Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.

TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

Abstract

Paper Structure (58 sections, 48 equations, 12 figures, 16 tables)

This paper contains 58 sections, 48 equations, 12 figures, 16 tables.

Introduction
Preliminaries
Method
Turn-level information rewards
Segment-level PBRS and policy invariance
Experiments
Main results
Analysis
Ablations
Shaping scale $\alpha$.
Related Work
Conclusions
Datasets
General Question Answering
Natural Questions (NQ)
...and 43 more sections

Figures (12)

Figure 1: Overview of our training framework. The policy model interacts with the environment by conducting multi-turn conversations: each turn consists of reasoning, issuing a query, and receiving search results, until a final answer is produced. Two dense reward signals are then derived: (i) an outcome reward, obtained by verifying whether the final answer matches the ground truth; and (ii) an information reward, provided by a teacher model that measures the information gain each turn contributes toward the ground truth. Both rewards are combined and optimized with PPO.
Figure 2: Turn-level information reward pipeline. At each turn, retrieved evidence updates the answer likelihood, yielding a turn-level reward $\Delta_k$. These rewards are then injected at turn boundaries.
Figure 3: EM accuracy on the training set. TIPS converges to a high accuracy; PPO drifts late; GRPO collapses.
Figure 4: Training dynamics of PPO vs. TIPS. Blue curves denote PPO and orange curves denote TIPS with teacher refresh every 200 steps. Overall, TIPS climbs steadily to higher and more stable plateaus, while PPO often suffers mid-training drift or collapse, especially on multi-hop datasets.
Figure 5: Distribution of token-level advantages. Aggregated advantages at final checkpoints. TIPS yields a clean bimodal distribution with concentrated positive mass, while PPO shows heavy tails and dense near-zero mass.
...and 7 more figures

TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

Abstract

TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

Authors

Abstract

Table of Contents

Figures (12)