Policy Bifurcation in Safe Reinforcement Learning

Wenjun Zou; Yao Lyu; Jie Li; Yujie Yang; Shengbo Eben Li; Jingliang Duan; Xianyuan Zhan; Jingjing Liu; Yaqin Zhang; Keqiang Li

Policy Bifurcation in Safe Reinforcement Learning

Wenjun Zou, Yao Lyu, Jie Li, Yujie Yang, Shengbo Eben Li, Jingliang Duan, Xianyuan Zhan, Jingjing Liu, Yaqin Zhang, Keqiang Li

TL;DR

The paper reveals a fundamental limitation of continuous policies in safe RL when safety constraints induce non-simply connected feasible sets, showing that the reachable tuple $\mathcal{R}$ can be noncontractible or that feasible continuous policies may not exist under certain initial conditions. It develops a topological framework based on paths, loops, and contractibility to derive sufficient conditions for suboptimality and infeasibility of continuous policies in constrained OCPs. To address this, it proposes Multimodal Policy Optimization (MUPO), a bifurcated policy method that outputs a Gaussian mixture with a gate selecting the highest-probability component, and it augments learning with spectral normalization and forward KL divergence to capture multiple modes. Empirical results in simulation (bypass and encounter tasks) and real-world robotics experiments demonstrate that MUPO achieves safety and near-optimal performance where continuous policies struggle, highlighting a practical shift toward bifurcated policy designs for safety-critical control.

Abstract

Safe reinforcement learning (RL) offers advanced solutions to constrained optimal control problems. Existing studies in safe RL implicitly assume continuity in policy functions, where policies map states to actions in a smooth, uninterrupted manner; however, our research finds that in some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations. We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of policy bifurcation in safe RL, which corresponds to the contractibility of the reachable tuple. Our theorem reveals that in scenarios where the obstacle-free state space is non-simply connected, a feasible policy is required to be bifurcated, meaning its output action needs to change abruptly in response to the varying state. To train such a bifurcated policy, we propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output. The bifurcated behavior can be achieved by selecting the Gaussian component with the highest mixing coefficient. Besides, MUPO also integrates spectral normalization and forward KL divergence to enhance the policy's capability of exploring different modes. Experiments with vehicle control tasks show that our algorithm successfully learns the bifurcated policy and ensures satisfying safety, while a continuous policy suffers from inevitable constraint violations.

Policy Bifurcation in Safe Reinforcement Learning

TL;DR

The paper reveals a fundamental limitation of continuous policies in safe RL when safety constraints induce non-simply connected feasible sets, showing that the reachable tuple

can be noncontractible or that feasible continuous policies may not exist under certain initial conditions. It develops a topological framework based on paths, loops, and contractibility to derive sufficient conditions for suboptimality and infeasibility of continuous policies in constrained OCPs. To address this, it proposes Multimodal Policy Optimization (MUPO), a bifurcated policy method that outputs a Gaussian mixture with a gate selecting the highest-probability component, and it augments learning with spectral normalization and forward KL divergence to capture multiple modes. Empirical results in simulation (bypass and encounter tasks) and real-world robotics experiments demonstrate that MUPO achieves safety and near-optimal performance where continuous policies struggle, highlighting a practical shift toward bifurcated policy designs for safety-critical control.

Abstract

Paper Structure (4 sections, 4 theorems, 23 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 4 sections, 4 theorems, 23 equations, 6 figures, 1 table, 1 algorithm.

Simulation experiments
Real-world experiments
Policy evaluation
Policy improvement

Key Result

Lemma 1

Given the control system defined by the dynamic function $f$ and the control policy $\pi$, we define the continuous-time state transition function as $F_{\pi}: \mathcal{X} \times \mathbb{R}_{\geq 0} \to \mathcal{X}$. $F_{\pi}(x_\mathrm{init},t)$, which maps an initial state $x_\mathrm{init}$ to the Here, the relationship between $F_{\pi}$ and the solution of the differential equation is given by:

Figures (6)

Figure 1: Limitations of continuous policies in vehicle control problems.(a) Illustration of a control problem for autonomous vehicles that encounter an obstacle and must reach a goal region. (b) The trajectories of a vehicle that start from different lateral positions $p_y$ (in meters) under the optimal policy. There is a bifurcation around $p_y=0$ where the vehicle steers around the obstacle in different directions. (c) Correlation between the front wheel steering angle $\delta$ (in radians) and lateral position $p_y$ for an autonomous vehicle starts at a fixed longitudinal position $p_x$ and speed. The dots show optimal steering responses, with jumps for varying lateral positions at $p_y=0$. The curve denotes a continuous policy that fails to initiate necessary avoidance maneuvers when starting from the road's center.
Figure 2: Suboptimality of continuous policies.(a) Illustration of a constrained OCP in a 2D state space, extended with a time dimension. Due to the fact that the safety constraints are active at all times, the violation region $\mathrm{X}_{\mathrm{viol}}$ is depicted as a cylinder that extends along the time dimension. For the open-loop optimal solution, trajectories from initial states $x_\mathrm{a}$ and $x_\mathrm{b}$ to goal states $x_\mathrm{a}'$ and $x_\mathrm{b}'$ bypass the obstacle on different sides. The loop $\ell_{aa'b'ba}$ cannot be continuously contracted to a point within $\mathrm{X}_{\mathrm{cstr}}$ without passing through $\mathrm{X}_{\mathrm{viol}}$, due to the non-simply connected property of $\mathrm{X}_{\mathrm{cstr}}$. According to our theoretical analysis, for any continuous policy acting in this manner, there must exist an initial state $x_{c}$, from which the trajectory must violate the constraints. (b) Illustration of a feasible continuous policy for the constrained OCP, showing the trajectory that avoids the obstacle from one side to ensure the feasibility of the continuous policy. Continuous policy is forced to take a significantly suboptimal trajectory compared to open-loop optimal solutions, leading to a substantial loss in optimality.
Figure 3: Infeasibility of continuous policies. This figure illustrates a constrained OCP within a 2D state space. The initial state set $\mathrm{X}_{\mathrm{init}}$ is noncontractible; for instance, a closed curve $\ell_{\mathrm{init}}$ within $\mathrm{X}_{\mathrm{init}}$ encircles the violation region $\mathrm{X}_{\mathrm{viol}}$. Under a continuous policy, all points on $\ell_{\mathrm{init}}$ after the same time period form a new closed curve, such as $\ell_{\mathrm{mid}}$ in the figure. However, there exists no continuous deformation in topology that can transform $\ell_{\mathrm{init}}$ into $\ell_{\mathrm{goal}}$ without intersecting $\mathrm{X}_{\mathrm{viol}}$, thus demonstrating the infeasibility of the continuous policy.
Figure 4: Experimental results and visualization.(a) Bypass task illustration. (b) Encounter task illustration. (c) Open-loop control for the Bypass task. (d) Open-loop control for the Encounter task. (e) Training curves for the Bypass task. (f) Training curves for the Encounter task. (g) Front-wheel steering angle $\delta$ changes over time for the Bypass task where the vehicle starts from different lateral positions $p_y$. (h) Trajectories for the Bypass task where the vehicle starts from different lateral positions $p_y$.
Figure 5: Comparison of autonomous vehicle trajectories under continuous versus bifurcated policies in real-world execution.(a) Trajectory visualization of autonomous driving with continuous policy. (b) Trajectory visualization of autonomous driving with bifurcated policy. (c) Snapshot of autonomous vehicle executing continuous policy. (d) Snapshot of autonomous vehicle executing bifurcated policy.
...and 1 more figures

Theorems & Definitions (14)

Lemma 1
proof
Definition 1: Augmented state
Definition 2: Truncated augmented state transition function
Definition 3: Reachable tuple
Definition 4: Path and loop
Definition 5: Contractibility
Lemma 2: Contractibility of the reachable tuple
proof
Theorem 1: Suboptimality of continuous policies
...and 4 more

Policy Bifurcation in Safe Reinforcement Learning

TL;DR

Abstract

Policy Bifurcation in Safe Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (14)