Meta-Gradient Search Control: A Method for Improving the Efficiency of Dyna-style Planning
Bradley Burega, John D. Martin, Luke Kapeluck, Michael Bowling
TL;DR
This work tackles sample-efficiency in model-based RL under imperfect, non-stationary dynamics by learning where to query a model during Dyna-style planning. It introduces Meta Gradient Search Control (MGSC), which meta-learns a state-distribution over planning queries by minimizing a meta-loss that ties planning updates to proximity to an optimal fixed point $\boldsymbol{\theta}^*$. The method uses a softmax parameterization $d(s;\boldsymbol{\eta})$ over states and backpropagates gradients through the planning updates (via $\bar{\boldsymbol{\theta}}(\boldsymbol{\eta})$) to shape search control with Adam optimization. Empirical results in TMaze and TwoRooms show MGSC improves planning efficiency and downstream sample-efficiency compared with fixed sampling baselines, including robustness to model imperfections and with learned models, pointing to practical benefits for scalable model-based RL.
Abstract
We study how a Reinforcement Learning (RL) system can remain sample-efficient when learning from an imperfect model of the environment. This is particularly challenging when the learning system is resource-constrained and in continual settings, where the environment dynamics change. To address these challenges, our paper introduces an online, meta-gradient algorithm that tunes a probability with which states are queried during Dyna-style planning. Our study compares the aggregate, empirical performance of this meta-gradient method to baselines that employ conventional sampling strategies. Results indicate that our method improves efficiency of the planning process, which, as a consequence, improves the sample-efficiency of the overall learning process. On the whole, we observe that our meta-learned solutions avoid several pathologies of conventional planning approaches, such as sampling inaccurate transitions and those that stall credit assignment. We believe these findings could prove useful, in future work, for designing model-based RL systems at scale.
