Table of Contents
Fetching ...

External Model Motivated Agents: Reinforcement Learning for Enhanced Environment Sampling

Rishav Bhagat, Jonathan Balloch, Zhiyu Lin, Julia Kim, Mark Riedl

TL;DR

This work tackles how reinforcement learning agents can aid the rapid adaptation of external models when environments shift, without altering task rewards. It introduces External Model Motivated Agent (EMMA), a reward-agnostic framework built from two modules: an Interest Field $f_{POI}: O ightarrow \,\mathbb{R}$ defined over the observation space $O$, and a POI Influence mechanism that biases data collection toward high-$f_{POI}$ observations. Two exemplar implementations are proposed: Monte Carlo Dropout Disagreement to quantify uncertainty for $f_{POI}$ and Interest-Valued Discrete Skill Sampling (POI DIAYN) to bias a skill-conditioned policy toward informative regions, leveraging a VAE-based observation sampler. In experiments on the DoorKeyChange task from NovGrid, EMMA variants achieve faster external-model adaptation and better post-transfer performance than PPO and online DIAYN baselines, indicating reward-agnostic motivation can substantially improve external-model learning. The findings suggest broad applicability to other external models and motivate future work integrating EMMA with diverse domains and uncertainty estimators.

Abstract

Unlike reinforcement learning (RL) agents, humans remain capable multitaskers in changing environments. In spite of only experiencing the world through their own observations and interactions, people know how to balance focusing on tasks with learning about how changes may affect their understanding of the world. This is possible by choosing to solve tasks in ways that are interesting and generally informative beyond just the current task. Motivated by this, we propose an agent influence framework for RL agents to improve the adaptation efficiency of external models in changing environments without any changes to the agent's rewards. Our formulation is composed of two self-contained modules: interest fields and behavior shaping via interest fields. We implement an uncertainty-based interest field algorithm as well as a skill-sampling-based behavior-shaping algorithm to use in testing this framework. Our results show that our method outperforms the baselines in terms of external model adaptation on metrics that measure both efficiency and performance.

External Model Motivated Agents: Reinforcement Learning for Enhanced Environment Sampling

TL;DR

This work tackles how reinforcement learning agents can aid the rapid adaptation of external models when environments shift, without altering task rewards. It introduces External Model Motivated Agent (EMMA), a reward-agnostic framework built from two modules: an Interest Field defined over the observation space , and a POI Influence mechanism that biases data collection toward high- observations. Two exemplar implementations are proposed: Monte Carlo Dropout Disagreement to quantify uncertainty for and Interest-Valued Discrete Skill Sampling (POI DIAYN) to bias a skill-conditioned policy toward informative regions, leveraging a VAE-based observation sampler. In experiments on the DoorKeyChange task from NovGrid, EMMA variants achieve faster external-model adaptation and better post-transfer performance than PPO and online DIAYN baselines, indicating reward-agnostic motivation can substantially improve external-model learning. The findings suggest broad applicability to other external models and motivate future work integrating EMMA with diverse domains and uncertainty estimators.

Abstract

Unlike reinforcement learning (RL) agents, humans remain capable multitaskers in changing environments. In spite of only experiencing the world through their own observations and interactions, people know how to balance focusing on tasks with learning about how changes may affect their understanding of the world. This is possible by choosing to solve tasks in ways that are interesting and generally informative beyond just the current task. Motivated by this, we propose an agent influence framework for RL agents to improve the adaptation efficiency of external models in changing environments without any changes to the agent's rewards. Our formulation is composed of two self-contained modules: interest fields and behavior shaping via interest fields. We implement an uncertainty-based interest field algorithm as well as a skill-sampling-based behavior-shaping algorithm to use in testing this framework. Our results show that our method outperforms the baselines in terms of external model adaptation on metrics that measure both efficiency and performance.
Paper Structure (36 sections, 6 equations, 8 figures, 3 algorithms)

This paper contains 36 sections, 6 equations, 8 figures, 3 algorithms.

Figures (8)

  • Figure 1: This system diagram outlines the process that occurs per episode to shape the behavior of a skill-based agent via the Monte Carlo dropout disagreement interest field. The POI sampling process occurs at the beginning of each episode $s$ times and then once a skill is sampled it is fixed for the full episode. After a rollout of data (with many episodes) are generated, the external model, policy, and other models introduced for our methods are updated using the new data.
  • Figure 2: These plots show the correct key distance external model losses over environment steps on both on policy and random agent rollouts. The steps directly after the transfer are shown to highlight the impact of our algorithms in adapting the extenal model. The plots show the IQM of external model loss for each method at each time step. In these plots, lower values are better as they correspond with better external model performance. These graphs will not align perfectly with the results in Figure \ref{['fig:8_metrics']} as these graphs aggregate losses per step while the table aggregates calculated metrics.
  • Figure 3: This table shows the post-transfer metrics defined in Section \ref{['sec:def_metrics']} for the interest-based methods and non-interest baselines on the experimental setup described. The IQM of the converged runs is used to aggregate the 10 seeds we ran. The adaptive efficiency values are the normalized (by PPO performance) number of steps till the external model loss hits the convergence threshold. The adaptive performance values are the normalized (by PPO performance) minimum external model loss post transfer. Lower values are better for both these metrics.
  • Figure 4: This table shows the same metrics as in Figure \ref{['fig:8_metrics']}, except now for the experiment with only 4 epochs per rollout.
  • Figure 5: This table shows the same metrics as in Figure \ref{['fig:8_metrics']}, except now for the experiment with only a single epoch per rollout.
  • ...and 3 more figures