Direct Soft-Policy Sampling via Langevin Dynamics
Donghyeon Ki, Hee-Jun Ahn, Kyungyoon Kim, Byung-Jun Lee
TL;DR
This work tackles the challenge of implementing soft policies in online RL by proposing Langevin Q-Learning (LQL), which directly samples actions from the soft Boltzmann policy defined by the current Q-function using Langevin dynamics. To overcome slow mixing in rugged Q-landscapes, it introduces Noise-Conditioned Langevin Q-Learning (NC-LQL), which applies multi-scale noise to the value function to create progressively smoothed landscapes, enabling efficient exploration and later refinement. The approach remains actor-free and avoids entropy or density estimation, while achieving competitive results on MuJoCo benchmarks and demonstrating strong multimodal action coverage in a 2D bandit setting; NC-LQL also shows favorable model complexity and training time compared to diffusion-based baselines. Overall, the paper provides a simple yet effective alternative to diffusion-based policies, combining principled sampling with practical efficiency for online RL.
Abstract
Soft policies in reinforcement learning define policies as Boltzmann distributions over state-action value functions, providing a principled mechanism for balancing exploration and exploitation. However, realizing such soft policies in practice remains challenging. Existing approaches either depend on parametric policies with limited expressivity or employ diffusion-based policies whose intractable likelihoods hinder reliable entropy estimation in soft policy objectives. We address this challenge by directly realizing soft-policy sampling via Langevin dynamics driven by the action gradient of the Q-function. This perspective leads to Langevin Q-Learning (LQL), which samples actions from the target Boltzmann distribution without explicitly parameterizing the policy. However, directly applying Langevin dynamics suffers from slow mixing in high-dimensional and non-convex Q-landscapes, limiting its practical effectiveness. To overcome this, we propose Noise-Conditioned Langevin Q-Learning (NC-LQL), which integrates multi-scale noise perturbations into the value function. NC-LQL learns a noise-conditioned Q-function that induces a sequence of progressively smoothed value landscapes, enabling sampling to transition from global exploration to precise mode refinement. On OpenAI Gym MuJoCo benchmarks, NC-LQL achieves competitive performance compared to state-of-the-art diffusion-based methods, providing a simple yet powerful solution for online RL.
