Direct Soft-Policy Sampling via Langevin Dynamics

Donghyeon Ki; Hee-Jun Ahn; Kyungyoon Kim; Byung-Jun Lee

Direct Soft-Policy Sampling via Langevin Dynamics

Donghyeon Ki, Hee-Jun Ahn, Kyungyoon Kim, Byung-Jun Lee

TL;DR

This work tackles the challenge of implementing soft policies in online RL by proposing Langevin Q-Learning (LQL), which directly samples actions from the soft Boltzmann policy defined by the current Q-function using Langevin dynamics. To overcome slow mixing in rugged Q-landscapes, it introduces Noise-Conditioned Langevin Q-Learning (NC-LQL), which applies multi-scale noise to the value function to create progressively smoothed landscapes, enabling efficient exploration and later refinement. The approach remains actor-free and avoids entropy or density estimation, while achieving competitive results on MuJoCo benchmarks and demonstrating strong multimodal action coverage in a 2D bandit setting; NC-LQL also shows favorable model complexity and training time compared to diffusion-based baselines. Overall, the paper provides a simple yet effective alternative to diffusion-based policies, combining principled sampling with practical efficiency for online RL.

Abstract

Soft policies in reinforcement learning define policies as Boltzmann distributions over state-action value functions, providing a principled mechanism for balancing exploration and exploitation. However, realizing such soft policies in practice remains challenging. Existing approaches either depend on parametric policies with limited expressivity or employ diffusion-based policies whose intractable likelihoods hinder reliable entropy estimation in soft policy objectives. We address this challenge by directly realizing soft-policy sampling via Langevin dynamics driven by the action gradient of the Q-function. This perspective leads to Langevin Q-Learning (LQL), which samples actions from the target Boltzmann distribution without explicitly parameterizing the policy. However, directly applying Langevin dynamics suffers from slow mixing in high-dimensional and non-convex Q-landscapes, limiting its practical effectiveness. To overcome this, we propose Noise-Conditioned Langevin Q-Learning (NC-LQL), which integrates multi-scale noise perturbations into the value function. NC-LQL learns a noise-conditioned Q-function that induces a sequence of progressively smoothed value landscapes, enabling sampling to transition from global exploration to precise mode refinement. On OpenAI Gym MuJoCo benchmarks, NC-LQL achieves competitive performance compared to state-of-the-art diffusion-based methods, providing a simple yet powerful solution for online RL.

Direct Soft-Policy Sampling via Langevin Dynamics

TL;DR

Abstract

Paper Structure (46 sections, 18 equations, 19 figures, 6 tables, 6 algorithms)

This paper contains 46 sections, 18 equations, 19 figures, 6 tables, 6 algorithms.

Introduction
Preliminaries
Reinforcement Learning (RL)
Langevin Dynamics
Score-Based Generative Modeling
Langevin Q-Learning
Langevin Q-Learning
Noise-Conditioned Langevin Q-Learning
Multi-Scale Noise Perturbation
Noise-Conditioned Langevin Q-Learning
TD-learning at $\sigma_L$
Value Smoothing
Relation to Previous Annealing Method
Implementation Details
Soft Policy Temperature
...and 31 more sections

Figures (19)

Figure 1: Comparison between standard actor-critic and Langevin Q-Learning (LQL). While standard actor-critic methods rely on explicit actor updates to approximate a target policy, LQL directly samples actions from the Boltzmann distribution defined by the Q-function via Langevin dynamics, removing the need for a separate actor update.
Figure 2: Visualization of value maps in a 2D bandit environment. We show the noise-conditioned Q-function $Q_{\text{NC}}(\mathbf{s}, \tilde{\mathbf{a}}, \sigma_i)$ at different noise scales ($i=1,4,7,10$), compared with the standard Bellman critic $Q(\mathbf{s}, \mathbf{a})$. Experimental details are provided in Appendix \ref{['appendix:bandit_env']}.
Figure 3: Visualization of samples obtained across different temperature parameters $w$ in the 2D bandit environment. The background represents the ground-truth reward landscape. Detailed experimental setup is provided in Appendix \ref{['appendix:bandit_env']}.
Figure 4: Visualization of samples obtained from different algorithms in the 2D bandit environment. White dots denote initial samples, and arrows indicate their corresponding denoised actions (yellow) sampled by each method. The numbers below each plot show the mean reward $\pm$ standard deviation, computed over 10k denoised samples. Detailed experimental setup is provided in Appendix \ref{['appendix:bandit_env']}.
Figure 5: Training performance on OpenAI Gym MuJoCo environments. Each curve reports the mean return over 5 random seeds, with shaded regions indicating the standard error.
...and 14 more figures

Theorems & Definitions (2)

Definition 1: Langevin soft policy
Definition 2: Noise-conditioned Langevin soft policy

Direct Soft-Policy Sampling via Langevin Dynamics

TL;DR

Abstract

Direct Soft-Policy Sampling via Langevin Dynamics

Authors

TL;DR

Abstract

Table of Contents

Figures (19)

Theorems & Definitions (2)