Table of Contents
Fetching ...

Stochastic Decision Horizons for Constrained Reinforcement Learning

Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev

TL;DR

This paper introduces stochastic decision horizons (SDH) as a horizon-shaping framework within Control as Inference to address constrained RL without relying on additive costs or Lagrange multipliers. By attaching a state-action-dependent continuation probability $\alpha(s,a)$, SDH produces shaped rewards $\tilde{r}=\alpha r$ and discounts $\tilde{\gamma}=\gamma\alpha$, yielding a survival-weighted objective that remains replay-compatible for off-policy learning. It formalizes two continuation semantics, absorbing state (AS) and virtual termination (VT), which share the same survival return but differ in KL-regularization interactions, enabling SAC-like AS-SAC and MPO-like VT-MPO algorithms. Empirically, SDH improves sample efficiency and reward–violation trade-offs on Safety Gymnasium and Hyfydy, with VT-MPO scaling effectively to high-dimensional musculoskeletal control without explicit dual optimization. The work lays a foundation for scalable, constraint-aware RL with replay, suggesting future directions in adaptive continuation and risk-aware extensions.

Abstract

Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.

Stochastic Decision Horizons for Constrained Reinforcement Learning

TL;DR

This paper introduces stochastic decision horizons (SDH) as a horizon-shaping framework within Control as Inference to address constrained RL without relying on additive costs or Lagrange multipliers. By attaching a state-action-dependent continuation probability , SDH produces shaped rewards and discounts , yielding a survival-weighted objective that remains replay-compatible for off-policy learning. It formalizes two continuation semantics, absorbing state (AS) and virtual termination (VT), which share the same survival return but differ in KL-regularization interactions, enabling SAC-like AS-SAC and MPO-like VT-MPO algorithms. Empirically, SDH improves sample efficiency and reward–violation trade-offs on Safety Gymnasium and Hyfydy, with VT-MPO scaling effectively to high-dimensional musculoskeletal control without explicit dual optimization. The work lays a foundation for scalable, constraint-aware RL with replay, suggesting future directions in adaptive continuation and risk-aware extensions.

Abstract

Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.
Paper Structure (123 sections, 11 theorems, 105 equations, 10 figures, 2 tables, 4 algorithms)

This paper contains 123 sections, 11 theorems, 105 equations, 10 figures, 2 tables, 4 algorithms.

Key Result

Proposition 4.1

Define the survival-return and survival-discount weights (with $u_0=1$) Then the two semantics induce distinct variational objectives:

Figures (10)

  • Figure 1: Overview of stochastic decision horizons (SDH) and evaluation domains. (a) SDH modulates the planning horizon via continuation probabilities ($\alpha$), inducing variable discounting $\tilde{\gamma}=\gamma\alpha$ and survival-weighted returns $\tilde{r}=\alpha r$. (b) Hyfydy humanoid locomotion tasks, where low-effort gait optimization is constrained by a target walking speed. (c) Safety Gymnasium environments, which combine goal-reaching rewards with standardized hazard costs.
  • Figure 2: Control as Inference (CaI) casts optimal control as variational inference in a probabilistic graphical model (PGM). The figure shows an infinite-horizon CaI-PGM, where a Bernoulli variable ($C_t$) encodes continuation depending on state-action events.
  • Figure 3: Aggregated sample efficiency on the Safety Gymnasium benchmark, computed using rliableagarwal2021deep. Curves show the interquartile mean (IQM) across environments and seeds for episodic reward (left) and cost (right) as a function of environment interactions. The reward is normalized w.r.t. unconstrained PPO at 10 million steps. Shaded regions denote 95% stratified bootstrap confidence intervals. VT-MPO and AS-SAC achieve favorable reward-violation trade-offs compared to the baselines.
  • Figure 4: Hyfydy gait metrics for VT-MPO ($\bullet$) and EWA ($\bullet$) on H0918, H1622, and H2190 under the minimal effort-velocity constrained formulation (no extra biomechanical penalties). Curves show mean and min/max over 3 seeds; the dashed line marks the target-velocity threshold $v_{\mathrm{fwd}}(s_t)=1.1m\per s$. VT-MPO converges to energy-efficient gaits while satisfying the velocity constraint.
  • Figure 5: Results on the Safety Gymnasium benchmark for individual environments.
  • ...and 5 more figures

Theorems & Definitions (22)

  • Proposition 4.1: ELBO objectives for AS/VT
  • Theorem 3.1: Exact objectives for CaI+AS and CaI+VT
  • Lemma 3.2: Trajectory KL reduces to a gated sum of log policy ratios
  • proof
  • Lemma 3.3: Conditional expectations of gates
  • proof
  • proof
  • Definition 3.5: Variable-discount evaluation operator
  • Lemma 3.6: Contraction
  • proof
  • ...and 12 more