Data-Efficient Task Generalization via Probabilistic Model-based Meta Reinforcement Learning
Arjun Bhardwaj, Jonas Rothfuss, Bhavya Sukhija, Yarden As, Marco Hutter, Stelian Coros, Andreas Krause
TL;DR
The paper tackles data-efficient task generalization in reinforcement learning by introducing PACOH-RL, which meta-learns a Bayesian neural network prior over dynamics using PACOH-NN and SVGD. On a new target task, it combines the prior with limited observed data to form posterior dynamics and employs uncertainty-aware optimistic exploration (H-UCRL) to guide data collection, enabling rapid adaptation. PACOH-RL supports both an iCEM-MPC planning variant and a SAC-based policy variant, with a Greedy baseline for ablations, and demonstrates superior sample efficiency and transfer to a real RC car under sparse rewards. The results show meaningful improvements over model-based RL baselines and model-free approaches, highlighting the practical potential of principled uncertainty quantification for fast, data-light policy adaptation in robotics.
Abstract
We introduce PACOH-RL, a novel model-based Meta-Reinforcement Learning (Meta-RL) algorithm designed to efficiently adapt control policies to changing dynamics. PACOH-RL meta-learns priors for the dynamics model, allowing swift adaptation to new dynamics with minimal interaction data. Existing Meta-RL methods require abundant meta-learning data, limiting their applicability in settings such as robotics, where data is costly to obtain. To address this, PACOH-RL incorporates regularization and epistemic uncertainty quantification in both the meta-learning and task adaptation stages. When facing new dynamics, we use these uncertainty estimates to effectively guide exploration and data collection. Overall, this enables positive transfer, even when access to data from prior tasks or dynamic settings is severely limited. Our experiment results demonstrate that PACOH-RL outperforms model-based RL and model-based Meta-RL baselines in adapting to new dynamic conditions. Finally, on a real robotic car, we showcase the potential for efficient RL policy adaptation in diverse, data-scarce conditions.
