Table of Contents
Fetching ...

Offline Reinforcement Learning with Imputed Rewards

Carlo Romeo, Andrew D. Bagdanov

TL;DR

The paper addresses offline reinforcement learning under severe reward sparsity by proposing a Reward Model that imputes missing rewards from a small labeled subset. The method trains a simple two-layer MLP on 1% labeled transitions and uses it to generate rewards for the 99% unlabeled transitions, yielding a complete offline dataset for standard ORL algorithms. On D4RL MuJoCo locomotion tasks, this imputation approach significantly improves TD3BC and IQL performance compared with using only the scarce rewards, sometimes approaching the full-data baselines. This work reduces the need for abundant reward annotations and expands the practical applicability of offline RL in real-world data-scarce settings.

Abstract

Offline Reinforcement Learning (ORL) offers a robust solution to training agents in applications where interactions with the environment must be strictly limited due to cost, safety, or lack of accurate simulation environments. Despite its potential to facilitate deployment of artificial agents in the real world, Offline Reinforcement Learning typically requires very many demonstrations annotated with ground-truth rewards. Consequently, state-of-the-art ORL algorithms can be difficult or impossible to apply in data-scarce scenarios. In this paper we propose a simple but effective Reward Model that can estimate the reward signal from a very limited sample of environment transitions annotated with rewards. Once the reward signal is modeled, we use the Reward Model to impute rewards for a large sample of reward-free transitions, thus enabling the application of ORL techniques. We demonstrate the potential of our approach on several D4RL continuous locomotion tasks. Our results show that, using only 1\% of reward-labeled transitions from the original datasets, our learned reward model is able to impute rewards for the remaining 99\% of the transitions, from which performant agents can be learned using Offline Reinforcement Learning.

Offline Reinforcement Learning with Imputed Rewards

TL;DR

The paper addresses offline reinforcement learning under severe reward sparsity by proposing a Reward Model that imputes missing rewards from a small labeled subset. The method trains a simple two-layer MLP on 1% labeled transitions and uses it to generate rewards for the 99% unlabeled transitions, yielding a complete offline dataset for standard ORL algorithms. On D4RL MuJoCo locomotion tasks, this imputation approach significantly improves TD3BC and IQL performance compared with using only the scarce rewards, sometimes approaching the full-data baselines. This work reduces the need for abundant reward annotations and expands the practical applicability of offline RL in real-world data-scarce settings.

Abstract

Offline Reinforcement Learning (ORL) offers a robust solution to training agents in applications where interactions with the environment must be strictly limited due to cost, safety, or lack of accurate simulation environments. Despite its potential to facilitate deployment of artificial agents in the real world, Offline Reinforcement Learning typically requires very many demonstrations annotated with ground-truth rewards. Consequently, state-of-the-art ORL algorithms can be difficult or impossible to apply in data-scarce scenarios. In this paper we propose a simple but effective Reward Model that can estimate the reward signal from a very limited sample of environment transitions annotated with rewards. Once the reward signal is modeled, we use the Reward Model to impute rewards for a large sample of reward-free transitions, thus enabling the application of ORL techniques. We demonstrate the potential of our approach on several D4RL continuous locomotion tasks. Our results show that, using only 1\% of reward-labeled transitions from the original datasets, our learned reward model is able to impute rewards for the remaining 99\% of the transitions, from which performant agents can be learned using Offline Reinforcement Learning.
Paper Structure (11 sections, 5 equations, 2 figures, 1 table)

This paper contains 11 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Illustration of the scenarios of interest. (a) Classical offline reinforcement learning solutions are trained with a large set of transitions in which the reward signal is fully defined. On the other hand, in real applications, the reward signal may be available for only a small fraction of the total transitions, as in (b). In this case, the ORL algorithms are forced to use only those transitions where the reward signal is present, because of their inability to exploit the entire distribution, unless the reward signal is modeled from the distribution of reward-labeled transitions.
  • Figure 2: Visual comparison of TD3BC and IQL results by changing the input distribution from 1% to 5% of reward-labeled transitions, for the Walker2D Medium Replay and Halfcheetah Medium Expert scenarios. In blue, we have the scores achieved by the TD3BC algorithm, whereas in orange, the IQL scores. The solid lines indicate the baseline scores of each algorithm trained on the original dataset. The dashed lines show the performance of our solution: only 1% of reward-labeled transitions from the original dataset are considered, while the remaining 99% of transitions are labeled by imputing the reward signal via our Reward Model. The solid green lines represent the score achieved by running BC agents on the original dataset.