Prior-Aligned Meta-RL: Thompson Sampling with Learned Priors and Guarantees in Finite-Horizon MDPs
Runlin Zhou, Chixiang Chen, Elynn Chen
TL;DR
This work introduces a meta-reinforcement-learning framework for finite-horizon MDPs that leverages a linear Q*-representation with shared Gaussian priors over task-specific parameters. Two Thompson-sampling based algorithms, MTSRL and MTSRL+, are developed to learn and exploit these learned priors, with a novel prior-alignment technique providing meta-regret guarantees relative to a meta-oracle. Theoretical results show favorable meta-regret scaling: known-covariance bounds ilde{O}(H^{3} S^{3/2} sqrt{AN} K) initially switching to ilde{O}(H^{4} S^{3/2} sqrt{AN K}) for large K, and similarly improved bounds when covariance is learned. Empirical results in a stateful recommendation task demonstrate rapid adaptation to a meta-prior and substantial gains over prior-independent RL and bandit-based baselines, confirming both the practicality and robustness of prior-aligned meta-RL with learned Q*-priors.
Abstract
We study meta-reinforcement learning in finite-horizon MDPs where related tasks share similar structures in their optimal action-value functions. Specifically, we posit a linear representation $Q^*_h(s,a)=Φ_h(s,a)\,θ^{(k)}_h$ and place a Gaussian meta-prior $ \mathcal{N}(θ^*_h,Σ^*_h)$ over the task-specific parameters $θ^{(k)}_h$. Building on randomized value functions, we propose two Thompson-style algorithms: (i) MTSRL, which learns only the prior mean and performs posterior sampling with the learned mean and known covariance; and (ii) $\text{MTSRL}^{+}$, which additionally estimates the covariance and employs prior widening to control finite-sample estimation error. Further, we develop a prior-alignment technique that couples the posterior under the learned prior with a meta-oracle that knows the true prior, yielding meta-regret guarantees: we match prior-independent Thompson sampling in the small-task regime and strictly improve with more tasks once the prior is learned. Concretely, for known covariance we obtain $\tilde{O}(H^{4}S^{3/2}\sqrt{ANK})$ meta-regret, and with learned covariance $\tilde{O}(H^{4}S^{3/2}\sqrt{AN^3K})$; both recover a better behavior than prior-independent after $K \gtrsim \tilde{O}(H^2)$ and $K \gtrsim \tilde{O}(N^2H^2)$, respectively. Simulations on a stateful recommendation environment (with feature and prior misspecification) show that after brief exploration, MTSRL/MTSRL\(^+\) track the meta-oracle and substantially outperform prior-independent RL and bandit-only meta-baselines. Our results give the first meta-regret guarantees for Thompson-style RL with learned Q-priors, and provide practical recipes (warm-start via RLSVI, OLS aggregation, covariance widening) for experiment-rich settings.
