Table of Contents
Fetching ...

ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models

Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Guojun Yin, Wei Lin, Ran He

TL;DR

This work tackles the high-variance, sparse-reward drawback of RL for tool-use in large language models by establishing a theoretical link between token-level entropy and training stability, showing that low-entropy, structured tokens drive rewards. It then introduces ResT, a token-level policy gradient reshaping method with entropy-informed per-token weights and a lightweight curriculum that progressively emphasizes reasoning tokens. The approach uses a rule-based reward combining format and tool-calling accuracy, and optimizes a PPO-style objective with per-token weights to reduce variance while maintaining unbiased learning. Empirical results on BFCL and API-Bank show state-of-the-art performance, with further gains when fine-tuned on larger LLMs, and ablations confirm the contributions of dynamic rewards, gradient shaping, and the curriculum.

Abstract

Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose \textbf{Res}haped \textbf{T}oken-level policy gradients (\textbf{ResT}) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This entropy-aware scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT achieves state-of-the-art results, outperforming prior methods by up to $8.76\%$. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by $4.11\%$ on single-turn tasks and $1.50\%$ on multi-turn base tasks. Code is available at https://github.com/1229095296/ResT_Tool_use_LLM.git.

ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models

TL;DR

This work tackles the high-variance, sparse-reward drawback of RL for tool-use in large language models by establishing a theoretical link between token-level entropy and training stability, showing that low-entropy, structured tokens drive rewards. It then introduces ResT, a token-level policy gradient reshaping method with entropy-informed per-token weights and a lightweight curriculum that progressively emphasizes reasoning tokens. The approach uses a rule-based reward combining format and tool-calling accuracy, and optimizes a PPO-style objective with per-token weights to reduce variance while maintaining unbiased learning. Empirical results on BFCL and API-Bank show state-of-the-art performance, with further gains when fine-tuned on larger LLMs, and ablations confirm the contributions of dynamic rewards, gradient shaping, and the curriculum.

Abstract

Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose \textbf{Res}haped \textbf{T}oken-level policy gradients (\textbf{ResT}) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This entropy-aware scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT achieves state-of-the-art results, outperforming prior methods by up to . When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by on single-turn tasks and on multi-turn base tasks. Code is available at https://github.com/1229095296/ResT_Tool_use_LLM.git.

Paper Structure

This paper contains 21 sections, 4 theorems, 60 equations, 4 figures, 6 tables, 2 algorithms.

Key Result

Lemma 1

Let $J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)]$ denote the expected return. For a single trajectory $\tau_i=(y_{i,0:T-1})$, define the trajectory-level gradient where $\hat{A}_i$ is advantage function. Given $G$ i.i.d. trajectories $\{\tau_i\}_{i=1}^G$ from $\pi_\theta$, the mini-batch estimator is $\widehat{\nabla J}=(1/G)\sum_{i=1}^G g_i$. Then for each coordinate $k$ the variance sati

Figures (4)

  • Figure 1: ResT decomposes multi-turn tool-use tasks into single-turn tasks and further reshapes the policy gradient according to the average entropy in different regions, enabling dense and effective reward signals.
  • Figure 2: Left: Overall accuracy on the API-Bank test set. Right: Overall accuracy on the BFCL test set. Axes are: NLST: None-Live Single Turn, MTLC: Multi-Turn Long Context, MTB: Multi-Turn Base, MTMF: Multi-Turn w/ Missing Functions, LST: Live Single Turn, MTMP: Multi-Turn w/ Missing Parameters.
  • Figure 3: Learning curves for ResT and GRPO during training steps. The training dynamics show that ResT achieves a significantly lower and smoother policy entropy compared to GRPO, while maintaining comparable reward performance and longer responses.
  • Figure 4: Average region entropy on Llama-3.2-3B-Instruct (left) and Qwen3-1.7B (right). For each model, we randomly sample 10 instances and compute the entropy distribution over generated tokens. Tokens are partitioned into functional regions—reasoning, tool invocation, and final response.

Theorems & Definitions (7)

  • Lemma 1: Policy Gradient Variance Decomposition
  • Lemma 2: Second-Order Moment and Entropy Connection
  • Theorem 1: Variance Upper Bound for Entropy-Aware Reweighting
  • Theorem 2: Optimal Entropy-Aware Reweighting
  • proof
  • proof
  • proof