Table of Contents
Fetching ...

TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, Xiaolong Wang

Abstract

Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.

TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

Abstract

Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.
Paper Structure (58 sections, 48 equations, 12 figures, 16 tables)

This paper contains 58 sections, 48 equations, 12 figures, 16 tables.

Figures (12)

  • Figure 1: Overview of our training framework. The policy model interacts with the environment by conducting multi-turn conversations: each turn consists of reasoning, issuing a query, and receiving search results, until a final answer is produced. Two dense reward signals are then derived: (i) an outcome reward, obtained by verifying whether the final answer matches the ground truth; and (ii) an information reward, provided by a teacher model that measures the information gain each turn contributes toward the ground truth. Both rewards are combined and optimized with PPO.
  • Figure 2: Turn-level information reward pipeline. At each turn, retrieved evidence updates the answer likelihood, yielding a turn-level reward $\Delta_k$. These rewards are then injected at turn boundaries.
  • Figure 3: EM accuracy on the training set. TIPS converges to a high accuracy; PPO drifts late; GRPO collapses.
  • Figure 4: Training dynamics of PPO vs. TIPS. Blue curves denote PPO and orange curves denote TIPS with teacher refresh every 200 steps. Overall, TIPS climbs steadily to higher and more stable plateaus, while PPO often suffers mid-training drift or collapse, especially on multi-hop datasets.
  • Figure 5: Distribution of token-level advantages. Aggregated advantages at final checkpoints. TIPS yields a clean bimodal distribution with concentrated positive mass, while PPO shows heavy tails and dense near-zero mass.
  • ...and 7 more figures