Stochastic Decision Horizons for Constrained Reinforcement Learning

Nikola Milosevic; Leonard Franz; Daniel Haeufle; Georg Martius; Nico Scherf; Pavel Kolev

Stochastic Decision Horizons for Constrained Reinforcement Learning

Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev

TL;DR

This paper introduces stochastic decision horizons (SDH) as a horizon-shaping framework within Control as Inference to address constrained RL without relying on additive costs or Lagrange multipliers. By attaching a state-action-dependent continuation probability $\alpha(s,a)$, SDH produces shaped rewards $\tilde{r}=\alpha r$ and discounts $\tilde{\gamma}=\gamma\alpha$, yielding a survival-weighted objective that remains replay-compatible for off-policy learning. It formalizes two continuation semantics, absorbing state (AS) and virtual termination (VT), which share the same survival return but differ in KL-regularization interactions, enabling SAC-like AS-SAC and MPO-like VT-MPO algorithms. Empirically, SDH improves sample efficiency and reward–violation trade-offs on Safety Gymnasium and Hyfydy, with VT-MPO scaling effectively to high-dimensional musculoskeletal control without explicit dual optimization. The work lays a foundation for scalable, constraint-aware RL with replay, suggesting future directions in adaptive continuation and risk-aware extensions.

Abstract

Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.

Stochastic Decision Horizons for Constrained Reinforcement Learning

TL;DR

, SDH produces shaped rewards

and discounts

, yielding a survival-weighted objective that remains replay-compatible for off-policy learning. It formalizes two continuation semantics, absorbing state (AS) and virtual termination (VT), which share the same survival return but differ in KL-regularization interactions, enabling SAC-like AS-SAC and MPO-like VT-MPO algorithms. Empirically, SDH improves sample efficiency and reward–violation trade-offs on Safety Gymnasium and Hyfydy, with VT-MPO scaling effectively to high-dimensional musculoskeletal control without explicit dual optimization. The work lays a foundation for scalable, constraint-aware RL with replay, suggesting future directions in adaptive continuation and risk-aware extensions.

Abstract

Paper Structure (123 sections, 11 theorems, 105 equations, 10 figures, 2 tables, 4 algorithms)

This paper contains 123 sections, 11 theorems, 105 equations, 10 figures, 2 tables, 4 algorithms.

Introduction
Stochastic decision horizons.
Contributions.
Related Work
On-policy methods for constrained RL.
Off-policy methods for constrained RL.
Constraint-based problem formulations.
Control as Inference.
Embodied exploration and musculoskeletal control.
Background
Control as Inference.
Soft Bellman operators.
Discounting as stochastic termination.
Control as Inference with Stochastic Decision Horizons
SDH as the general continuation model.
...and 108 more sections

Key Result

Proposition 4.1

Define the survival-return and survival-discount weights (with $u_0=1$) Then the two semantics induce distinct variational objectives:

Figures (10)

Figure 1: Overview of stochastic decision horizons (SDH) and evaluation domains. (a) SDH modulates the planning horizon via continuation probabilities ($\alpha$), inducing variable discounting $\tilde{\gamma}=\gamma\alpha$ and survival-weighted returns $\tilde{r}=\alpha r$. (b) Hyfydy humanoid locomotion tasks, where low-effort gait optimization is constrained by a target walking speed. (c) Safety Gymnasium environments, which combine goal-reaching rewards with standardized hazard costs.
Figure 2: Control as Inference (CaI) casts optimal control as variational inference in a probabilistic graphical model (PGM). The figure shows an infinite-horizon CaI-PGM, where a Bernoulli variable ($C_t$) encodes continuation depending on state-action events.
Figure 3: Aggregated sample efficiency on the Safety Gymnasium benchmark, computed using rliableagarwal2021deep. Curves show the interquartile mean (IQM) across environments and seeds for episodic reward (left) and cost (right) as a function of environment interactions. The reward is normalized w.r.t. unconstrained PPO at 10 million steps. Shaded regions denote 95% stratified bootstrap confidence intervals. VT-MPO and AS-SAC achieve favorable reward-violation trade-offs compared to the baselines.
Figure 4: Hyfydy gait metrics for VT-MPO ($\bullet$) and EWA ($\bullet$) on H0918, H1622, and H2190 under the minimal effort-velocity constrained formulation (no extra biomechanical penalties). Curves show mean and min/max over 3 seeds; the dashed line marks the target-velocity threshold $v_{\mathrm{fwd}}(s_t)=1.1m\per s$. VT-MPO converges to energy-efficient gaits while satisfying the velocity constraint.
Figure 5: Results on the Safety Gymnasium benchmark for individual environments.
...and 5 more figures

Theorems & Definitions (22)

Proposition 4.1: ELBO objectives for AS/VT
Theorem 3.1: Exact objectives for CaI+AS and CaI+VT
Lemma 3.2: Trajectory KL reduces to a gated sum of log policy ratios
proof
Lemma 3.3: Conditional expectations of gates
proof
proof
Definition 3.5: Variable-discount evaluation operator
Lemma 3.6: Contraction
proof
...and 12 more

Stochastic Decision Horizons for Constrained Reinforcement Learning

TL;DR

Abstract

Stochastic Decision Horizons for Constrained Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (22)