Meta-Gradient Search Control: A Method for Improving the Efficiency of Dyna-style Planning

Bradley Burega; John D. Martin; Luke Kapeluck; Michael Bowling

Meta-Gradient Search Control: A Method for Improving the Efficiency of Dyna-style Planning

Bradley Burega, John D. Martin, Luke Kapeluck, Michael Bowling

TL;DR

This work tackles sample-efficiency in model-based RL under imperfect, non-stationary dynamics by learning where to query a model during Dyna-style planning. It introduces Meta Gradient Search Control (MGSC), which meta-learns a state-distribution over planning queries by minimizing a meta-loss that ties planning updates to proximity to an optimal fixed point $\boldsymbol{\theta}^*$. The method uses a softmax parameterization $d(s;\boldsymbol{\eta})$ over states and backpropagates gradients through the planning updates (via $\bar{\boldsymbol{\theta}}(\boldsymbol{\eta})$) to shape search control with Adam optimization. Empirical results in TMaze and TwoRooms show MGSC improves planning efficiency and downstream sample-efficiency compared with fixed sampling baselines, including robustness to model imperfections and with learned models, pointing to practical benefits for scalable model-based RL.

Abstract

We study how a Reinforcement Learning (RL) system can remain sample-efficient when learning from an imperfect model of the environment. This is particularly challenging when the learning system is resource-constrained and in continual settings, where the environment dynamics change. To address these challenges, our paper introduces an online, meta-gradient algorithm that tunes a probability with which states are queried during Dyna-style planning. Our study compares the aggregate, empirical performance of this meta-gradient method to baselines that employ conventional sampling strategies. Results indicate that our method improves efficiency of the planning process, which, as a consequence, improves the sample-efficiency of the overall learning process. On the whole, we observe that our meta-learned solutions avoid several pathologies of conventional planning approaches, such as sampling inaccurate transitions and those that stall credit assignment. We believe these findings could prove useful, in future work, for designing model-based RL systems at scale.

Meta-Gradient Search Control: A Method for Improving the Efficiency of Dyna-style Planning

TL;DR

. The method uses a softmax parameterization

over states and backpropagates gradients through the planning updates (via

) to shape search control with Adam optimization. Empirical results in TMaze and TwoRooms show MGSC improves planning efficiency and downstream sample-efficiency compared with fixed sampling baselines, including robustness to model imperfections and with learned models, pointing to practical benefits for scalable model-based RL.

Abstract

Paper Structure (25 sections, 5 equations, 11 figures, 1 table, 1 algorithm)

This paper contains 25 sections, 5 equations, 11 figures, 1 table, 1 algorithm.

Introduction
Problem Setting
Learning from a Model
Learning a Model
Querying a Model (Search Control)
Meta Gradient Search Control
The Meta-Loss
The Search Control Strategy
Meta Gradient Search Control in Dyna
Empirical Analysis
TMaze: Fixed Model
TMaze: Learned Model
Robustness to Imperfections
TwoRooms: Learned Model
Summary and Future Work
...and 10 more sections

Figures (11)

Figure 1: System diagram of training with Meta Gradient Search Control. The gray box denotes replication over the index $i$. The initial value parameters $\boldsymbol{\theta}$ are used for computing actions in the model $m$, the update operations, and in the MGSC loss.
Figure 2: TMaze Fixed Model Performance: (a) The total reward reflects the sample-efficiency of each learning algorithm. Error bars denote the 95% confidence interval over 30 seeds. (b) The average reward shows how learning performance varies through time and how each system copes with non-stationarity.
Figure 3: TMaze Fixed Model Solution: Evolution of MGSC's learned state distribution.
Figure 4: TMaze Learned Model Performance: (a) The total reward accumulated by each agent over the course of training. Error bars denote the 95% confidence interval. (b) The average reward accumulated during training for each agent.
Figure 5: TMaze Learned Model Solution: Evolution of MGSC's learned state distribution.
...and 6 more figures

Meta-Gradient Search Control: A Method for Improving the Efficiency of Dyna-style Planning

TL;DR

Abstract

Meta-Gradient Search Control: A Method for Improving the Efficiency of Dyna-style Planning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)