Isoperimetry is All We Need: Langevin Posterior Sampling for RL with Sublinear Regret

Emilio Jorge; Christos Dimitrakakis; Debabrota Basu

Isoperimetry is All We Need: Langevin Posterior Sampling for RL with Sublinear Regret

Emilio Jorge, Christos Dimitrakakis, Debabrota Basu

TL;DR

This work broadens Bayesian RL theory beyond Gaussian or log-concave posteriors by leveraging distributions that satisfy the Log-Sobolev Inequality (LSI). It proves sublinear Bayesian regret for PSRL under LSI and introduces LaPSRL, a Langevin-sampling-based extension for approximate posteriors, achieving order-optimal regret with subquadratic per-episode complexity. The authors provide gradient-complexity bounds via SARAH-LD and demonstrate sublinear regret across Gaussian, mixture, and continuous MDP environments, supported by experiments on Gaussian bandits, Cartpole, and Reacher. By unifying PSRL and Langevin-based posterior sampling under the LSI/isoperimetry framework, the paper broadens applicability to richer posterior models and offers practical algorithms with robust theoretical guarantees. The results have potential impact for robust RL in complex, non-log-concave settings and guide future work in deep RL with isoperimetric priors.

Abstract

Common assumptions, like linear or RKHS models, and Gaussian or log-concave posteriors over the models, do not explain practical success of RL across a wider range of distributions and models. Thus, we study how to design RL algorithms with sublinear regret for isoperimetric distributions, specifically the ones satisfying the Log-Sobolev Inequality (LSI). LSI distributions include the standard setups of RL theory, and others, such as many non-log-concave and perturbed distributions. First, we show that the Posterior Sampling-based RL (PSRL) algorithm yields sublinear regret if the data distributions satisfy LSI and some mild additional assumptions. Also, when we cannot compute or sample from an exact posterior, we propose a Langevin sampling-based algorithm design: LaPSRL. We show that LaPSRL achieves order-optimal regret and subquadratic complexity per episode. Finally, we deploy LaPSRL with a Langevin sampler -- SARAH-LD, and test it for different bandit and MDP environments. Experimental results validate the generality of LaPSRL across environments and its competitive performance with respect to the baselines.

Isoperimetry is All We Need: Langevin Posterior Sampling for RL with Sublinear Regret

TL;DR

Abstract

Paper Structure (20 sections, 26 theorems, 38 equations, 3 figures, 3 tables, 4 algorithms)

This paper contains 20 sections, 26 theorems, 38 equations, 3 figures, 3 tables, 4 algorithms.

Introduction
Problem Setup & Background
PSRL for Exact posteriors
LaPSRL for Approximate Posteriors
Distributions with Linear LSI Constants
Experimental Analysis
Extended Related Works
Discussion & Future Works
Notation
Algorithmic Details: SARAH-LD
SARAH-LD
Regret Bounds for PSRL with Exact Posteriors
Confidence Intervals for Isoperimetric Data Distributions
Regret for Posteriors with Linear LSI Constants
Regret Bounds and Sample Complexity for LaPSRL with Approximate Posteriors
...and 5 more sections

Key Result

Theorem 1

If for distribution $\nu$, $- \nabla_{\theta}^2 \log \nu \geq \alpha I_d$, where the inequality is the Loewner order, $I_d$ the identity matrix of dimension $d$ and $\theta$ the parametrization of $\nu$, then $\nu$ fulfils LSI with constant $\alpha$.

Figures (3)

Figure 1: Examples of log-Sobolev distributions.
Figure 2: We compare LaPSRL against baselines. In the bandit and Cartpole experiments, we benchmark with PSRL, and in Reacher with TD3 and PPO. For the Gaussian bandits, we compare the expected regret and for Cartpole we evaluate how many episodes it takes to solve the task. Finally, in Reacher, we study the average regret per episode. In all environments, we average over 50 runs with the standard error highlighted around the average. Larger plots are in \ref{['fig:app_plots']}.
Figure 3: We compare LaPSRL versus baselines. In the bandit and Cartpole experiments we benchmark with PSRL, in Reacher with TD3. In a) we compare the expected regret for a Gaussian bandit algorithm. In b) we compare how many episodes it takes to solve a Cartpole task. In c) we study the average regret per episode in the Reacher environement. In all environments, we average over 50 independent runs with the standard error highlighted around the average.

Theorems & Definitions (39)

Definition 1: log-Sobolev inequality
Theorem 1: Bakry-Émery criterion
Theorem 2: steinergao2021feynmankac
Theorem 3
Remark 1: Prior Design
Remark 2: Lipschitz Log-likelihood
Lemma 1
Theorem 4
Theorem 5
Corollary 1
...and 29 more

Isoperimetry is All We Need: Langevin Posterior Sampling for RL with Sublinear Regret

TL;DR

Abstract

Isoperimetry is All We Need: Langevin Posterior Sampling for RL with Sublinear Regret

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (39)