Coreset-Based Task Selection for Sample-Efficient Meta-Reinforcement Learning
Donglin Zhan, Leonardo F. Toso, James Anderson
TL;DR
This paper tackles sample inefficiency in model-agnostic meta-RL by introducing a derivative-free, coreset-based task selection scheme that picks a weighted subset of tasks to maximize diversity in gradient space. It grounds the method in submodular optimization, provides a rigorous ergodic convergence analysis for non-concave task rewards, and demonstrates substantial sample-efficiency gains both in general MAML-RL and in the MAML-LQR control setting, where a gradient-dominance assumption yields a logarithmic or better convergence rate. The approach yields a provable reduction in training samples required to reach an $ε$-near stationary solution, corroborated by numerical experiments on deep RL benchmarks and LQR variants. Overall, task selection emerges as a key lever for scalable, fast-adapting meta-RL systems with practical impact in robotics and control.
Abstract
We study task selection to enhance sample efficiency in model-agnostic meta-reinforcement learning (MAML-RL). Traditional meta-RL typically assumes that all available tasks are equally important, which can lead to task redundancy when they share significant similarities. To address this, we propose a coreset-based task selection approach that selects a weighted subset of tasks based on how diverse they are in gradient space, prioritizing the most informative and diverse tasks. Such task selection reduces the number of samples needed to find an $ε$-close stationary solution by a factor of O(1/$ε$). Consequently, it guarantees a faster adaptation to unseen tasks while focusing training on the most relevant tasks. As a case study, we incorporate task selection to MAML-LQR (Toso et al., 2024b), and prove a sample complexity reduction proportional to O(log(1/$ε$)) when the task specific cost also satisfy gradient dominance. Our theoretical guarantees underscore task selection as a key component for scalable and sample-efficient meta-RL. We numerically validate this trend across multiple RL benchmark problems, illustrating the benefits of task selection beyond the LQR baseline.
