Table of Contents
Fetching ...

Active Reinforcement Learning Strategies for Offline Policy Improvement

Ambedkar Dukkipati, Ranga Shaarad Ayyagari, Bodhisattwa Dasgupta, Parag Dutta, Prabhas Reddy Onteru

TL;DR

The paper tackles improving sequential decision-making under restricted online interactions by introducing ActiveORL, which augments offline RL data with informative trajectories selected via representation-based epistemic uncertainty. It combines a base offline RL algorithm with an active collection strategy that selects uncertain initial states and uses an uncertainty-driven exploration policy to gather diverse data within a budget, including a two-stage approach for restricted starting positions. Empirically, ActiveORL reduces online data requirements by up to $75\%$ while boosting performance across Maze2d, AntMaze, MuJoCo locomotion, CARLA, and IsaacSim-Go1, and ablations confirm the contributions of both active initial-state selection and uncertainty-based exploration. The method is compatible with multiple offline algorithms (e.g., TD3+BC, IQL, CQL, BPPO) and demonstrates strong data efficiency and generalization in varied continuous-control tasks with pruned or limited offline datasets.

Abstract

Learning agents that excel at sequential decision-making tasks must continuously resolve the problem of exploration and exploitation for optimal learning. However, such interactions with the environment online might be prohibitively expensive and may involve some constraints, such as a limited budget for agent-environment interactions and restricted exploration in certain regions of the state space. Examples include selecting candidates for medical trials and training agents in complex navigation environments. This problem necessitates the study of active reinforcement learning strategies that collect minimal additional experience trajectories by reusing existing offline data previously collected by some unknown behavior policy. In this work, we propose an active reinforcement learning method capable of collecting trajectories that can augment existing offline data. With extensive experimentation, we demonstrate that our proposed method reduces additional online interaction with the environment by up to 75% over competitive baselines across various continuous control environments such as Gym-MuJoCo locomotion environments as well as Maze2d, AntMaze, CARLA and IsaacSimGo1. To the best of our knowledge, this is the first work that addresses the active learning problem in the context of sequential decision-making and reinforcement learning.

Active Reinforcement Learning Strategies for Offline Policy Improvement

TL;DR

The paper tackles improving sequential decision-making under restricted online interactions by introducing ActiveORL, which augments offline RL data with informative trajectories selected via representation-based epistemic uncertainty. It combines a base offline RL algorithm with an active collection strategy that selects uncertain initial states and uses an uncertainty-driven exploration policy to gather diverse data within a budget, including a two-stage approach for restricted starting positions. Empirically, ActiveORL reduces online data requirements by up to while boosting performance across Maze2d, AntMaze, MuJoCo locomotion, CARLA, and IsaacSim-Go1, and ablations confirm the contributions of both active initial-state selection and uncertainty-based exploration. The method is compatible with multiple offline algorithms (e.g., TD3+BC, IQL, CQL, BPPO) and demonstrates strong data efficiency and generalization in varied continuous-control tasks with pruned or limited offline datasets.

Abstract

Learning agents that excel at sequential decision-making tasks must continuously resolve the problem of exploration and exploitation for optimal learning. However, such interactions with the environment online might be prohibitively expensive and may involve some constraints, such as a limited budget for agent-environment interactions and restricted exploration in certain regions of the state space. Examples include selecting candidates for medical trials and training agents in complex navigation environments. This problem necessitates the study of active reinforcement learning strategies that collect minimal additional experience trajectories by reusing existing offline data previously collected by some unknown behavior policy. In this work, we propose an active reinforcement learning method capable of collecting trajectories that can augment existing offline data. With extensive experimentation, we demonstrate that our proposed method reduces additional online interaction with the environment by up to 75% over competitive baselines across various continuous control environments such as Gym-MuJoCo locomotion environments as well as Maze2d, AntMaze, CARLA and IsaacSimGo1. To the best of our knowledge, this is the first work that addresses the active learning problem in the context of sequential decision-making and reinforcement learning.

Paper Structure

This paper contains 22 sections, 11 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: [Best viewed in color] Consider an offline dataset as shown in (a). Our method computes uncertainties in various regions of the environment according to the dataset. As shown in (b), the uncertainties are high in regions where data is not present in the dataset. We collect new trajectories starting from the uncertain regions since that provides more information to the learning algorithm. As can be seen in (c), a simple online trajectory collection policy collects redundant trajectories, while our method focuses on previously unobserved regions, as evident from (d).
  • Figure 2: The figures display the terrains for the Unitree Go1 robot experiments in the Nvidia Isaac Simulator. We named the three terrains (from left to right in order) go1-easy, go1-medium and go1-hard. The behavior policy was trained on the go1-easy terrain and achieves reasonably high rewards for the locomotion task on the flat surface, as shown. However, we assume that the environment has been modified, and the agent needs to update its policy as quickly as possible in the modified environment. If the agent efficiently uses its exploration budget, then it will be able to generalize the experiences gathered during Active Collection and be able to get high rewards in the go1-hard terrain in spite of being given access to go1-medium terrain during Active trajectory collection. The accompanying video in the supplementary materials demonstrates the advantage of using our active trajectory collection method.
  • Figure 3: [Best viewed in color] Results of our algorithm compared with the corresponding fine-tuning baseline. In the shaded plots, the results are averaged over multiple random seeds, with the shaded region denoting the standard deviation.
  • Figure 4: Plots corresponding to experiments where the agent is restricted to start from the original initial state distribution rather than the modified initial state distribution. We use a goal-based policy to reach a state close to the uncertain state and then switch over to our exploration policy.
  • Figure 5: The pruned maze2d D4RL datasets. The first image (on the left) corresponds to the maze2d-medium-v1 environment. We prune the dataset near the goal state to create maze2d-medium-easy-v1. The other two images correspond to versions of the maze2d-large-v1 environment, an easy version maze2d-large-easy-v1, and a hard version maze2d-large-hard-v1 respectively. The blue lines in the images correspond to the offline transitions remaining in the final dataset after pruning.
  • ...and 8 more figures