Isoperimetry is All We Need: Langevin Posterior Sampling for RL with Sublinear Regret
Emilio Jorge, Christos Dimitrakakis, Debabrota Basu
TL;DR
This work broadens Bayesian RL theory beyond Gaussian or log-concave posteriors by leveraging distributions that satisfy the Log-Sobolev Inequality (LSI). It proves sublinear Bayesian regret for PSRL under LSI and introduces LaPSRL, a Langevin-sampling-based extension for approximate posteriors, achieving order-optimal regret with subquadratic per-episode complexity. The authors provide gradient-complexity bounds via SARAH-LD and demonstrate sublinear regret across Gaussian, mixture, and continuous MDP environments, supported by experiments on Gaussian bandits, Cartpole, and Reacher. By unifying PSRL and Langevin-based posterior sampling under the LSI/isoperimetry framework, the paper broadens applicability to richer posterior models and offers practical algorithms with robust theoretical guarantees. The results have potential impact for robust RL in complex, non-log-concave settings and guide future work in deep RL with isoperimetric priors.
Abstract
Common assumptions, like linear or RKHS models, and Gaussian or log-concave posteriors over the models, do not explain practical success of RL across a wider range of distributions and models. Thus, we study how to design RL algorithms with sublinear regret for isoperimetric distributions, specifically the ones satisfying the Log-Sobolev Inequality (LSI). LSI distributions include the standard setups of RL theory, and others, such as many non-log-concave and perturbed distributions. First, we show that the Posterior Sampling-based RL (PSRL) algorithm yields sublinear regret if the data distributions satisfy LSI and some mild additional assumptions. Also, when we cannot compute or sample from an exact posterior, we propose a Langevin sampling-based algorithm design: LaPSRL. We show that LaPSRL achieves order-optimal regret and subquadratic complexity per episode. Finally, we deploy LaPSRL with a Langevin sampler -- SARAH-LD, and test it for different bandit and MDP environments. Experimental results validate the generality of LaPSRL across environments and its competitive performance with respect to the baselines.
