Data-Efficient Task Generalization via Probabilistic Model-based Meta Reinforcement Learning

Arjun Bhardwaj; Jonas Rothfuss; Bhavya Sukhija; Yarden As; Marco Hutter; Stelian Coros; Andreas Krause

Data-Efficient Task Generalization via Probabilistic Model-based Meta Reinforcement Learning

Arjun Bhardwaj, Jonas Rothfuss, Bhavya Sukhija, Yarden As, Marco Hutter, Stelian Coros, Andreas Krause

TL;DR

The paper tackles data-efficient task generalization in reinforcement learning by introducing PACOH-RL, which meta-learns a Bayesian neural network prior over dynamics using PACOH-NN and SVGD. On a new target task, it combines the prior with limited observed data to form posterior dynamics and employs uncertainty-aware optimistic exploration (H-UCRL) to guide data collection, enabling rapid adaptation. PACOH-RL supports both an iCEM-MPC planning variant and a SAC-based policy variant, with a Greedy baseline for ablations, and demonstrates superior sample efficiency and transfer to a real RC car under sparse rewards. The results show meaningful improvements over model-based RL baselines and model-free approaches, highlighting the practical potential of principled uncertainty quantification for fast, data-light policy adaptation in robotics.

Abstract

We introduce PACOH-RL, a novel model-based Meta-Reinforcement Learning (Meta-RL) algorithm designed to efficiently adapt control policies to changing dynamics. PACOH-RL meta-learns priors for the dynamics model, allowing swift adaptation to new dynamics with minimal interaction data. Existing Meta-RL methods require abundant meta-learning data, limiting their applicability in settings such as robotics, where data is costly to obtain. To address this, PACOH-RL incorporates regularization and epistemic uncertainty quantification in both the meta-learning and task adaptation stages. When facing new dynamics, we use these uncertainty estimates to effectively guide exploration and data collection. Overall, this enables positive transfer, even when access to data from prior tasks or dynamic settings is severely limited. Our experiment results demonstrate that PACOH-RL outperforms model-based RL and model-based Meta-RL baselines in adapting to new dynamic conditions. Finally, on a real robotic car, we showcase the potential for efficient RL policy adaptation in diverse, data-scarce conditions.

Data-Efficient Task Generalization via Probabilistic Model-based Meta Reinforcement Learning

TL;DR

Abstract

Paper Structure (27 sections, 18 equations, 10 figures, 5 tables)

This paper contains 27 sections, 18 equations, 10 figures, 5 tables.

Introduction
Related Work
Background
PACOH-RL: Uncertainty-Aware Model-Based Meta-RL
Problem Statement: Meta-RL
Our approach: Model-Based Meta-RL
Experiments
Simulation Experiments
Hardware Experiments
Conclusion
Method Details
Meta-Learning Dynamics Model Priors
Adapting the dynamics model to the target task
Model-based Control of PACOH-RL
The Model Predictive Control Variant
...and 12 more sections

Figures (10)

Figure 1: The PACOH-RL framework uses datasets of transitions $\mathcal{D}_1, ..., \mathcal{D}_n$ from previous RL tasks to meta-learn a BNN prior. Then, we equip our BNN dynamics model with the meta-learned prior. This significantly improves the sample efficiency of model-based RL on a new target task.
Figure 2: Returns on evaluation tasks averaged over five seeds. We compare PACOH-RL to its greedy counterpart, PACOH-RL (greedy), GrBAL nagabandi2019learning, GrBAL-2x, H-UCRL curi2020efficient, and PETS-DS chua2018deep. For all the environments, PACOH-RL systematically outperforms the baselines in terms of sample efficiency and average return.
Figure 3: Returns after learning on evaluation tasks with sparse rewards for 10 episodes over five different seeds. We compare PACOH-RL to its greedy counterpart, PACOH-RL (greedy), H-UCRL, and its greedy version PETS-DS. In all environments, optimistic planning outperforms its greedy counterpart, with PACOH-RL performing the best.
Figure 4: High torque motor RC car used in the hardware experiments. As depicted on the right, we have two different tire profiles for both the front and rear wheels. We can also add up to 400g of weight to the front of the car in a cylindrical box encircled in the image.
Figure 5: Trajectories of the RC car obtained under different dynamical settings. Starting at rest, we apply the same control sequence for 50 timesteps at 30 Hz. We repeat the experiment three times for each setting and plot the mean trajectory. The crosses along the trajectory correspond to the car's mean position at an interval of 10 timesteps. The ellipses correspond to the empirical standard deviation in the car's position. The first two digits in the legend labels denote the sets of wheels used in the front and rear, respectively, and the third digit denotes the added weight in hectograms.
...and 5 more figures

Data-Efficient Task Generalization via Probabilistic Model-based Meta Reinforcement Learning

TL;DR

Abstract

Data-Efficient Task Generalization via Probabilistic Model-based Meta Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)