Stochastic Decision Horizons for Constrained Reinforcement Learning
Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev
TL;DR
This paper introduces stochastic decision horizons (SDH) as a horizon-shaping framework within Control as Inference to address constrained RL without relying on additive costs or Lagrange multipliers. By attaching a state-action-dependent continuation probability $\alpha(s,a)$, SDH produces shaped rewards $\tilde{r}=\alpha r$ and discounts $\tilde{\gamma}=\gamma\alpha$, yielding a survival-weighted objective that remains replay-compatible for off-policy learning. It formalizes two continuation semantics, absorbing state (AS) and virtual termination (VT), which share the same survival return but differ in KL-regularization interactions, enabling SAC-like AS-SAC and MPO-like VT-MPO algorithms. Empirically, SDH improves sample efficiency and reward–violation trade-offs on Safety Gymnasium and Hyfydy, with VT-MPO scaling effectively to high-dimensional musculoskeletal control without explicit dual optimization. The work lays a foundation for scalable, constraint-aware RL with replay, suggesting future directions in adaptive continuation and risk-aware extensions.
Abstract
Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.
