Table of Contents
Fetching ...

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, Zhi-Quan Luo

TL;DR

The paper tackles the high-cost exploration challenge in RL for LLMs by addressing heterogeneous task difficulty and introducing Knapsack RL, which allocates exploration budgets across prompts via a knapsack optimization over task-budget pairs. It defines Task Value as the product of the probability of obtaining a non-zero gradient and an information-gain measure, with an approximate InfoGain term ${\rm InfoGain}(p_i)\approx p_i(1-p_i)^2$, and solves the optimization with dynamic programming under budget constraints. Empirically, Knapsack-GRPO yields a 20–40% increase in effective gradient ratios and 2–4 point average gains (peaking at 9 points) on math-understanding benchmarks, often equating to near double the efficiency of homogeneous allocation. The approach is model- and task-agnostic within the GRPO framework and promises scalable improvements for diverse LLM RL settings, while maintaining compatibility with existing infrastructure and offering avenues for further enhancements in value functions and exploration strategies.

Abstract

Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

TL;DR

The paper tackles the high-cost exploration challenge in RL for LLMs by addressing heterogeneous task difficulty and introducing Knapsack RL, which allocates exploration budgets across prompts via a knapsack optimization over task-budget pairs. It defines Task Value as the product of the probability of obtaining a non-zero gradient and an information-gain measure, with an approximate InfoGain term , and solves the optimization with dynamic programming under budget constraints. Empirically, Knapsack-GRPO yields a 20–40% increase in effective gradient ratios and 2–4 point average gains (peaking at 9 points) on math-understanding benchmarks, often equating to near double the efficiency of homogeneous allocation. The approach is model- and task-agnostic within the GRPO framework and promises scalable improvements for diverse LLM RL settings, while maintaining compatibility with existing infrastructure and offering avenues for further enhancements in value functions and exploration strategies.

Abstract

Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.

Paper Structure

This paper contains 27 sections, 2 theorems, 26 equations, 19 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Given a prompt $i$ with the success rate $p_i \in (0, 1)$, we have that

Figures (19)

  • Figure 1: Illustration of our framework for allocating exploration budgets among tasks from computational resources. We model each task as an item with learning value and computational cost, then solve the allocation problem using Knapsack optimization.
  • Figure 2: The ratio of effective gradients and zero gradients during training.
  • Figure 3: Exploration budget required to ensure non-zero gradients based on success rate. Note that success rates with in the same bins are grouped from real samples, which may not be symmetry, rendering the exploration budget may not be symmetry as the theory suggests.
  • Figure 4: The interplay between success rate, exploration budget and the value.
  • Figure 5: Distribution of exploration budgets allocated by knapsack-GRPO for Qwen2.5-Math-7B during training.
  • ...and 14 more figures

Theorems & Definitions (5)

  • Definition 1: Success Rate
  • Theorem 1: Exploration Budget
  • Proposition 1
  • proof : Proof of \ref{['lem:budget_required']}
  • proof : Proof of Proposition \ref{['prop:info_value']}