Table of Contents
Fetching ...

ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization

Chen Bo Calvin Zhang, Zhang-Wei Hong, Aldo Pacchiano, Pulkit Agrawal

TL;DR

This work tackles reward shaping in reinforcement learning by reframing shaping-reward design as an online model-selection problem. The authors introduce ORSO, which first generates a set of candidate shaping rewards and then uses a data-driven online selection strategy (D$^3$RB) to allocate training budget across these rewards while training corresponding policies, aiming to maximize the task reward $R$. They provide regret guarantees for ORSO under a monotonic-best-learner assumption and demonstrate substantial data- and compute-efficiency gains in continuous control tasks using Isaac Gym and PPO, including up to $8\times$ compute savings and, on average, more than $50\%$ higher task rewards than prior methods; ORSO often matches or surpasses manually engineered rewards with significantly less compute. The experimental results also show robustness to larger reward sets and that simpler exploration strategies can offer strong improvements, highlighting practical applicability for accelerating reward design in real-world RL systems. Overall, ORSO offers a principled, scalable framework for automatic shaping-reward selection with theoretical guarantees and compelling empirical performance gains.

Abstract

Reward shaping is critical in reinforcement learning (RL), particularly for complex tasks where sparse rewards can hinder learning. However, choosing effective shaping rewards from a set of reward functions in a computationally efficient manner remains an open challenge. We propose Online Reward Selection and Policy Optimization (ORSO), a novel approach that frames the selection of shaping reward function as an online model selection problem. ORSO automatically identifies performant shaping reward functions without human intervention with provable regret guarantees. We demonstrate ORSO's effectiveness across various continuous control tasks. Compared to prior approaches, ORSO significantly reduces the amount of data required to evaluate a shaping reward function, resulting in superior data efficiency and a significant reduction in computational time (up to 8 times). ORSO consistently identifies high-quality reward functions outperforming prior methods by more than 50% and on average identifies policies as performant as the ones learned using manually engineered reward functions by domain experts.

ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization

TL;DR

This work tackles reward shaping in reinforcement learning by reframing shaping-reward design as an online model-selection problem. The authors introduce ORSO, which first generates a set of candidate shaping rewards and then uses a data-driven online selection strategy (DRB) to allocate training budget across these rewards while training corresponding policies, aiming to maximize the task reward . They provide regret guarantees for ORSO under a monotonic-best-learner assumption and demonstrate substantial data- and compute-efficiency gains in continuous control tasks using Isaac Gym and PPO, including up to compute savings and, on average, more than higher task rewards than prior methods; ORSO often matches or surpasses manually engineered rewards with significantly less compute. The experimental results also show robustness to larger reward sets and that simpler exploration strategies can offer strong improvements, highlighting practical applicability for accelerating reward design in real-world RL systems. Overall, ORSO offers a principled, scalable framework for automatic shaping-reward selection with theoretical guarantees and compelling empirical performance gains.

Abstract

Reward shaping is critical in reinforcement learning (RL), particularly for complex tasks where sparse rewards can hinder learning. However, choosing effective shaping rewards from a set of reward functions in a computationally efficient manner remains an open challenge. We propose Online Reward Selection and Policy Optimization (ORSO), a novel approach that frames the selection of shaping reward function as an online model selection problem. ORSO automatically identifies performant shaping reward functions without human intervention with provable regret guarantees. We demonstrate ORSO's effectiveness across various continuous control tasks. Compared to prior approaches, ORSO significantly reduces the amount of data required to evaluate a shaping reward function, resulting in superior data efficiency and a significant reduction in computational time (up to 8 times). ORSO consistently identifies high-quality reward functions outperforming prior methods by more than 50% and on average identifies policies as performant as the ones learned using manually engineered reward functions by domain experts.

Paper Structure

This paper contains 50 sections, 3 theorems, 20 equations, 25 figures, 3 tables, 8 algorithms.

Key Result

Lemma 4.3

Under event $\mathcal{E}$ and ass:monotonic_best_learner, with probability $1 - \delta$, the regret of all learners $i$ is bounded in all rounds $T$ as where $d_T^{i_\star} = d_{\left(n_T^{i_\star}\right)}^{i_\star}$.

Figures (25)

  • Figure 1: Comparison of three reward selection strategies given a fixed interaction budget. The green dashed line represents the task reward of the optimal policy, $\pi^\star$. The red and blue curves show the task rewards for policies trained with reward functions $f^1$ and $f^2$, respectively. The yellow curve, $\pi_t^\star$, tracks the maximum of the red and blue curves. Left: This selection strategy is overly exploitative, greedily selecting the reward function that seems to perform best early on but plateaus later in training. Center: On the other hand, this strategy continuously switched between $f^1$ and $f^2$, exploring the suboptimal reward function too much. Right: The ideal strategy initially explore $f^1$ and $f^2$, but quickly latches onto the better reward function.
  • Figure 2: Left: Normalized task rewards averaged over interaction budgets and seeds. Orso consistently matches or surpasses human-designed reward functions. Right: Normalized task reward as a function of interaction budget, averaged across tasks. Orso scales effectively with increased budgets, achieving a 56% higher task reward than the naive strategy at the highest budget. Vertical bars in the plots indicate standard errors.
  • Figure 3: Median time to human-level performance as a function of number of parallel GPUs. Policies trained with Orso can achieve the same performance as policies trained with the human-engineered reward functions with up to $8 \times$ fewer GPUs.
  • Figure 4: Comparison of different rewards selection algorithms for Orso. Left: Number of iterations necessary for human-level performance. Right: Average normalized task reward for different selection algorithms. We provide a more granular breakdown in \ref{['app:exp']}.
  • Figure 5: Regret of different selection algorithms with varying budgets. We recall that a budget $B$ indicates that the Orso has been run for $B \times \texttt{n\_iters}$ iterations.
  • ...and 20 more figures

Theorems & Definitions (10)

  • Definition 3.1: Reward Design
  • Definition 4.1: Definition 2.1 from pacchiano2023data
  • Definition 4.3: Definition 8.1 from pacchiano2023data
  • Lemma 4.3
  • Remark 1
  • proof
  • Lemma D.1: Non-doubling regret coefficient
  • proof
  • Lemma D.1
  • proof