Table of Contents
Fetching ...

SplAgger: Split Aggregation for Meta-Reinforcement Learning

Jacob Beck, Matthew Jackson, Risto Vuorio, Zheng Xiong, Shimon Whiteson

TL;DR

This paper investigates whether permutation-invariant sequence models remain advantageous for meta-reinforcement learning when trained end-to-end, and finds that permutation-variant components can also improve performance in certain domains. It introduces SplAgger, a Split Aggregator that combines permutation-invariant aggregation with permutation-variant encoding, while discarding AMRL’s gradient modification that can cause gradient explosions. Empirical evaluations across MuJoCo and memory-based tasks show SplAgger delivering the best overall returns and sample efficiency, with ablations clarifying the roles of the invariant/variant components and the gradient design. The work provides practical guidance on aggregation design for end-to-end meta-RL and contributes a reusable end-to-end architecture with public code.

Abstract

A core ambition of reinforcement learning (RL) is the creation of agents capable of rapid learning in novel tasks. Meta-RL aims to achieve this by directly learning such agents. Black box methods do so by training off-the-shelf sequence models end-to-end. By contrast, task inference methods explicitly infer a posterior distribution over the unknown task, typically using distinct objectives and sequence models designed to enable task inference. Recent work has shown that task inference methods are not necessary for strong performance. However, it remains unclear whether task inference sequence models are beneficial even when task inference objectives are not. In this paper, we present evidence that task inference sequence models are indeed still beneficial. In particular, we investigate sequence models with permutation invariant aggregation, which exploit the fact that, due to the Markov property, the task posterior does not depend on the order of data. We empirically confirm the advantage of permutation invariant sequence models without the use of task inference objectives. However, we also find, surprisingly, that there are multiple conditions under which permutation variance remains useful. Therefore, we propose SplAgger, which uses both permutation variant and invariant components to achieve the best of both worlds, outperforming all baselines evaluated on continuous control and memory environments. Code is provided at https://github.com/jacooba/hyper.

SplAgger: Split Aggregation for Meta-Reinforcement Learning

TL;DR

This paper investigates whether permutation-invariant sequence models remain advantageous for meta-reinforcement learning when trained end-to-end, and finds that permutation-variant components can also improve performance in certain domains. It introduces SplAgger, a Split Aggregator that combines permutation-invariant aggregation with permutation-variant encoding, while discarding AMRL’s gradient modification that can cause gradient explosions. Empirical evaluations across MuJoCo and memory-based tasks show SplAgger delivering the best overall returns and sample efficiency, with ablations clarifying the roles of the invariant/variant components and the gradient design. The work provides practical guidance on aggregation design for end-to-end meta-RL and contributes a reusable end-to-end architecture with public code.

Abstract

A core ambition of reinforcement learning (RL) is the creation of agents capable of rapid learning in novel tasks. Meta-RL aims to achieve this by directly learning such agents. Black box methods do so by training off-the-shelf sequence models end-to-end. By contrast, task inference methods explicitly infer a posterior distribution over the unknown task, typically using distinct objectives and sequence models designed to enable task inference. Recent work has shown that task inference methods are not necessary for strong performance. However, it remains unclear whether task inference sequence models are beneficial even when task inference objectives are not. In this paper, we present evidence that task inference sequence models are indeed still beneficial. In particular, we investigate sequence models with permutation invariant aggregation, which exploit the fact that, due to the Markov property, the task posterior does not depend on the order of data. We empirically confirm the advantage of permutation invariant sequence models without the use of task inference objectives. However, we also find, surprisingly, that there are multiple conditions under which permutation variance remains useful. Therefore, we propose SplAgger, which uses both permutation variant and invariant components to achieve the best of both worlds, outperforming all baselines evaluated on continuous control and memory environments. Code is provided at https://github.com/jacooba/hyper.
Paper Structure (42 sections, 7 equations, 14 figures)

This paper contains 42 sections, 7 equations, 14 figures.

Figures (14)

  • Figure 1: The hypernetwork from beck2023recurrent is depicted in \ref{['fig:hyper']}, the AMRL model from beck2020AMRL is depicted in \ref{['fig:amrl']}, and SplAgger is depicted in \ref{['fig:splag']}. The angled line indicates a split connection that divides the neurons in half. The red arrow indicates a modified gradient computation in the backward pass. A hypernetwork is indicated by $h$. SplAgger makes use of the hypernetwork architecture combined with the AMRL sequence model. The hypernetwork architecture is necessary for performant end-to-end training. Critically, SplAgger also removes the gradient modification from AMRL which we show to be deleterious to performance.
  • Figure 2: A preview of later results. The permutation invariance of the max aggregator improves returns relative the RNN on the MC-LS environment beck2020AMRL, but decreases returns on the Planning Game ritter2021rapid. Additionally, the ST gradient decreases the returns of the max aggregation. These results motivate SplAgger, which achieves the highest returns. (Results are reported with a 68% confidence interval, computed through bootstrapping with 1,000 iterations across three seeds, consistent with all plots presented.)
  • Figure 3: Depictions of the Planning Game, T-LS, T-Maze Agreement, and T-Maze Latent environments used in Sections \ref{['sec:memexp']} and \ref{['sec:abl']}.
  • Figure 4: Results on MuJoCo benchmarks. SplAgger achieves the same or better results on both domains. PEARL achieves significantly lower return on Cheetah-Dir.
  • Figure 5: Results on memory benchmarks. SplAgger achieves the highest returns on both domains, indicating the fastest learning. The standard RNN is not able to learn on either domain within the allotted number of frames.
  • ...and 9 more figures