Test-Time Regret Minimization in Meta Reinforcement Learning
Mirco Mutti, Aviv Tamar
TL;DR
This work analyzes test-time regret minimization in meta reinforcement learning with a finite set of MDPs under a perfect training assumption. It proves a fundamental lower bound under a separation condition, showing that regret must scale at least as $\Omega\left( \frac{T M \log(H)}{\lambda} \right)$, and that prior approaches achieving $O(M^2 \log(MH))$ are nearly optimal within this regime. To overcome the inherent linear-in-$M$ barrier, the paper introduces strong identifiability and presents three concrete structural regimes—clustering, tree structure, and revealing policies—that enable fast, $\log(H)$-type rates and sublinear dependence on $M$ (up to polylog factors). These results deepen the understanding of when meta-RL can outperform standard RL at test time and provide a blueprint for algorithm design in structured multitask environments. The findings illuminate when structured meta-learning offers practical efficiency gains and outline several directions for extending these insights to broader, possibly infinite task sets.
Abstract
Meta reinforcement learning sets a distribution over a set of tasks on which the agent can train at will, then is asked to learn an optimal policy for any test task efficiently. In this paper, we consider a finite set of tasks modeled through Markov decision processes with various dynamics. We assume to have endured a long training phase, from which the set of tasks is perfectly recovered, and we focus on regret minimization against the optimal policy in the unknown test task. Under a separation condition that states the existence of a state-action pair revealing a task against another, Chen et al. (2022) show that $O(M^2 \log(H))$ regret can be achieved, where $M, H$ are the number of tasks in the set and test episodes, respectively. In our first contribution, we demonstrate that the latter rate is nearly optimal by developing a novel lower bound for test-time regret minimization under separation, showing that a linear dependence with $M$ is unavoidable. Then, we present a family of stronger yet reasonable assumptions beyond separation, which we call strong identifiability, enabling algorithms achieving fast rates $\log (H)$ and sublinear dependence with $M$ simultaneously. Our paper provides a new understanding of the statistical barriers of test-time regret minimization and when fast rates can be achieved.
