TIMRL: A Novel Meta-Reinforcement Learning Framework for Non-Stationary and Multi-Task Environments

Chenyang Qi; Huiping Li; Panfeng Huang

TIMRL: A Novel Meta-Reinforcement Learning Framework for Non-Stationary and Multi-Task Environments

Chenyang Qi, Huiping Li, Panfeng Huang

TL;DR

TIMRL addresses the challenge of sample-inefficient meta-RL in non-stationary and multi-task environments. It introduces a Gaussian Mixture Model with $K$ components to encode multiple task classes and a transformer-based recognition network to select the corresponding Gaussian component and produce a latent code $z$ that conditions the policy. The training objective decouples task inference from policy learning and combines a VAE-style loss with a reconstruction term $\\mathcal{L}_{recons}$ and a regularization term $\\mathcal{L}_{regula}$, along with a supervised recognition loss. The policy is learned with SAC conditioned on $z$, enabling off-policy, efficient learning. On extended MuJoCo benchmarks, TIMRL demonstrates superior sample efficiency, accurate task classification (recognition accuracy approaching $95\%$), and strong asymptotic performance in non-stationary and multi-task settings.

Abstract

In recent years, meta-reinforcement learning (meta-RL) algorithm has been proposed to improve sample efficiency in the field of decision-making and control, enabling agents to learn new knowledge from a small number of samples. However, most research uses the Gaussian distribution to extract task representation, which is poorly adapted to tasks that change in non-stationary environment. To address this problem, we propose a novel meta-reinforcement learning method by leveraging Gaussian mixture model and the transformer network to construct task inference model. The Gaussian mixture model is utilized to extend the task representation and conduct explicit encoding of tasks. Specifically, the classification of tasks is encoded through transformer network to determine the Gaussian component corresponding to the task. By leveraging task labels, the transformer network is trained using supervised learning. We validate our method on MuJoCo benchmarks with non-stationary and multi-task environments. Experimental results demonstrate that the proposed method dramatically improves sample efficiency and accurately recognizes the classification of the tasks, while performing excellently in the environment.

TIMRL: A Novel Meta-Reinforcement Learning Framework for Non-Stationary and Multi-Task Environments

TL;DR

TIMRL addresses the challenge of sample-inefficient meta-RL in non-stationary and multi-task environments. It introduces a Gaussian Mixture Model with

components to encode multiple task classes and a transformer-based recognition network to select the corresponding Gaussian component and produce a latent code

that conditions the policy. The training objective decouples task inference from policy learning and combines a VAE-style loss with a reconstruction term

and a regularization term

, along with a supervised recognition loss. The policy is learned with SAC conditioned on

, enabling off-policy, efficient learning. On extended MuJoCo benchmarks, TIMRL demonstrates superior sample efficiency, accurate task classification (recognition accuracy approaching

), and strong asymptotic performance in non-stationary and multi-task settings.

Abstract

Paper Structure (16 sections, 16 equations, 8 figures, 1 algorithm)

This paper contains 16 sections, 16 equations, 8 figures, 1 algorithm.

Introduction
related work
Meta Reinforcement Learning
Task Inference and Task Embedding for Meta-RL
Preliminaries
Context-based Meta-RL
Gaussian Mixture Model
Transformer Network
Task Inference in Meta-RL
Task Inference Model
Training Model
TIMRL
Experiments
Experiments Setup
Performance
...and 1 more sections

Figures (8)

Figure 1: We use a GMM-based task inference model, where the Gaussian component corresponding to the task is selected by the recognition network and the task embedding $z$ is extracted from the Gaussian distribution.
Figure 2: TIMRL primarily consists of a task inference model (encoder), a decoder, and a policy network (SAC). We use GMM to construct task inference model and design a recognition network to classify tasks. Based on the classification result, the corresponding Gaussian component is selected to generate the task embedding $z$. The policy network conditioned on the task embedding is trained. The dashed arrows indicate the gradient backpropagation process.
Figure 3: Four set of environments for algorithm evaluation. From left to right the environments are Cheetah-Nonstat-Dir, Cheetah-Nonstat-Vel, Cheetah-Nonstat-Flipping, and Ant-Nonstat-Dir.
Figure 4: Reward curves for the meta-testing task during training with the non-stationary environments. These curves represent the performance of our algorithms in terms of asymptotic performance and sample efficiency in the above benchmark tasks. The solid lines represent the asymptotic performance of each algorithm. Non-stationary environments: Cheetah-Nonstat-Dir, Cheetah-Nonstat-Vel, Cheetah-Nonstat-Flipping, Ant-Nonstat-Dir.
Figure 5: Task recognition accuracy per epoch of the recognition network when trained in non-stationary environments corresponding to Fig. \ref{['fig.4']}.
...and 3 more figures

TIMRL: A Novel Meta-Reinforcement Learning Framework for Non-Stationary and Multi-Task Environments

TL;DR

Abstract

TIMRL: A Novel Meta-Reinforcement Learning Framework for Non-Stationary and Multi-Task Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (8)