Boosting Soft Q-Learning by Bounding
Jacob Adamczyk, Volodymyr Makarenko, Stas Tiomkin, Rahul V. Kulkarni
TL;DR
This work introduces a zero-shot framework to bound the optimal soft Q-function $Q^*(s,a)$ from any bounded value estimate in entropy-regularized RL, enabling double-sided bounds that facilitate clipping during training. By deriving bounds from $V(s)=\frac{1}{\beta}\log \mathbb{E}_{a\sim\pi_0} e^{\beta Q(s,a)}$ and $\Delta(s,a)=r(s,a)+\gamma \mathbb{E}_{s'}V(s')-Q(s,a)$, the authors design a clipping mechanism that accelerates learning in both tabular and function-approximation settings, and prove convergence of the clipped Bellman operator to $Q^*$. The approach extends to continuous spaces via batch-extrema approximations and Lipschitz-based probabilistic bounds, with empirical validation showing faster convergence and robust training in diverse tasks. Overall, the bounds provide a principled way to reuse prior value information to tighten target estimates and improve data efficiency in value-based RL. Potential applications include tighter initializations, ensemble-based bound tightening, and integration with model-based or actor-critic methods to boost performance in real-world tasks.
Abstract
An agent's ability to leverage past experience is critical for efficiently solving new tasks. Prior work has focused on using value function estimates to obtain zero-shot approximations for solutions to a new task. In soft Q-learning, we show how any value function estimate can also be used to derive double-sided bounds on the optimal value function. The derived bounds lead to new approaches for boosting training performance which we validate experimentally. Notably, we find that the proposed framework suggests an alternative method for updating the Q-function, leading to boosted performance.
