Table of Contents
Fetching ...

Reward-Conditioned Reinforcement Learning

Michal Nauman, Marek Cygan, Pieter Abbeel

TL;DR

Across single-task, multi-task, and vision-based benchmarks, it is shown that RCRL not only improves performance under the nominal reward parameterization, but also enables efficient adaptation to new parameterizations.

Abstract

RL agents are typically trained under a single, fixed reward function, which makes them brittle to reward misspecification and limits their ability to adapt to changing task preferences. We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications while collecting experience under only one nominal objective. RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy, enabling a single policy to represent reward-specific behaviors. Across single-task, multi-task, and vision-based benchmarks, we show that RCRL not only improves performance under the nominal reward parameterization, but also enables efficient adaptation to new parameterizations. Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.

Reward-Conditioned Reinforcement Learning

TL;DR

Across single-task, multi-task, and vision-based benchmarks, it is shown that RCRL not only improves performance under the nominal reward parameterization, but also enables efficient adaptation to new parameterizations.

Abstract

RL agents are typically trained under a single, fixed reward function, which makes them brittle to reward misspecification and limits their ability to adapt to changing task preferences. We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications while collecting experience under only one nominal objective. RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy, enabling a single policy to represent reward-specific behaviors. Across single-task, multi-task, and vision-based benchmarks, we show that RCRL not only improves performance under the nominal reward parameterization, but also enables efficient adaptation to new parameterizations. Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.
Paper Structure (29 sections, 3 equations, 18 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 3 equations, 18 figures, 5 tables, 1 algorithm.

Figures (18)

  • Figure 1: Results summary. We find that the proposed RCRL framework improves performance under the nominal reward in both single-task and multi-task settings, and substantially improves performance when transferring to new reward functions. Furthermore, as shown in Figure \ref{['fig:results_zeroshot']}, RCRL enables zero-shot transfer and policy steerability capabilities that are absent in standard RL.
  • Figure 2: Overview of the proposed approach. (A) During environment interaction, actions are sampled from a policy conditioned on the nominal reward parameterization $\psi^{\star}$, and transitions together with reward components are stored in the replay buffer. (B) For each transition in the batch, a reward parameterization $\psi$ is independently drawn from the distribution $\mathcal{P}_{\Psi}$ and used to compute the corresponding scalar reward $r_{\psi}$. (C) Training proceeds as in the underlying base algorithm, with the modification that the environment state is concatenated to the sampled reward parameterization $\psi$ and provided as input to both the actor and critic.
  • Figure 3: Sample efficiency of RCRL when evaluated under the nominal reward function. Training curves for single-task (Top) and multi-task (Bottom) benchmarks. In the single-task RCRL uses parameterized reward conditioning, and in the multi-task setting it uses auxiliary task conditioning. Across both regimes, RCRL improves efficiency compared to baseline algorithms.
  • Figure 4: Efficient transfer with RCRL. (Left) Zero-shot and finetuning performance. For the best-performing source-target task pairs, RCRL attains up to $40\%$ of optimal performance without any finetuning, and up to $90\%$ after $250$k fine-tuning steps.(Middle & Right) Heatmaps illustrating finetuning performance for all task pairs after $250k$ environment steps. The middle panel compares SimbaV2+RCRL finetuning to SimbaV2 finetuning, while the right panel compares SimbaV2+RCRL finetuning to training SimbaV2 from scratch on the target task. The results show synergies for certain task pairs, alongside pairs where transfer is less effective.
  • Figure 5: Zero-shot policy adjustment with RCRL. Auxiliary reward functions promote different behaviors: running speed for cheetah, standing height for hopper, and action penalty for humanoid. We compare a vanilla single-task SimbaV2 agent, SimbaV2+RCRL, and full multi-task BRC agent that both trains and explores under multiple reward functions. We present behavioral metrics as the policy is conditioned on different reward parameterizations (top row), and corresponding returns under each reward (top row). While the vanilla agent is unable to adjust its behavior without retraining, RCRL achieves behavior modulation comparable to full multi-task learning, despite learning alternative objectives fully off-policy without collecting additional data under those rewards.
  • ...and 13 more figures