Table of Contents
Fetching ...

Provably Efficient RL for Linear MDPs under Instantaneous Safety Constraints in Non-Convex Feature Spaces

Amirhossein Roknilamouki, Arnob Ghosh, Ming Shi, Fatemeh Nourzad, Eylem Ekici, Ness B. Shroff

TL;DR

This work addresses safe reinforcement learning under instantaneous hard safety constraints in non-convex feature spaces for linear MDPs. It builds two complementary strategies: OCD (for star-convex decision spaces) to tightly bound the covering number of value functions, and NCS-LSVI (for non-star-convex spaces) with a pure-safe exploration phase to stabilize the safe set before balanced exploration–exploitation. The authors prove regret bounds of order $\tilde{\mathcal{O}}\big((1+1/\tau)\sqrt{\log(1/\tau)\; d^3 H^4 K}\big)$ with zero safety violations w.h.p., and show that the non-star-convex setting can attain near-parity performance via a two-phase approach, validated by autonomous-driving simulations. The results illuminate how the geometry of the feasible action space governs safe-RL complexity and provide a foundation for extending to more expressive, nonlinear representations. Overall, the paper contributes a rigorous, geometry-aware safe-RL framework with provable guarantees and practical relevance to safety-critical domains.

Abstract

In Reinforcement Learning (RL), tasks with instantaneous hard constraints present significant challenges, particularly when the decision space is non-convex or non-star-convex. This issue is especially relevant in domains like autonomous vehicles and robotics, where constraints such as collision avoidance often take a non-convex form. In this paper, we establish a regret bound of $\tilde{\mathcal{O}}\bigl(\bigl(1 + \tfrac{1}τ\bigr) \sqrt{\log(\tfrac{1}τ) d^3 H^4 K} \bigr)$, applicable to both star-convex and non-star-convex cases, where $d$ is the feature dimension, $H$ the episode length, $K$ the number of episodes, and $τ$ the safety threshold. Moreover, the violation of safety constraints is zero with high probability throughout the learning process. A key technical challenge in these settings is bounding the covering number of the value-function class, which is essential for achieving value-aware uniform concentration in model-free function approximation. For the star-convex setting, we develop a novel technique called Objective Constraint-Decomposition (OCD) to properly bound the covering number. This result also resolves an error in a previous work on constrained RL. In non-star-convex scenarios, where the covering number can become infinitely large, we propose a two-phase algorithm, Non-Convex Safe Least Squares Value Iteration (NCS-LSVI), which first reduces uncertainty about the safe set by playing a known safe policy. After that, it carefully balances exploration and exploitation to achieve the regret bound. Finally, numerical simulations on an autonomous driving scenario demonstrate the effectiveness of NCS-LSVI.

Provably Efficient RL for Linear MDPs under Instantaneous Safety Constraints in Non-Convex Feature Spaces

TL;DR

This work addresses safe reinforcement learning under instantaneous hard safety constraints in non-convex feature spaces for linear MDPs. It builds two complementary strategies: OCD (for star-convex decision spaces) to tightly bound the covering number of value functions, and NCS-LSVI (for non-star-convex spaces) with a pure-safe exploration phase to stabilize the safe set before balanced exploration–exploitation. The authors prove regret bounds of order with zero safety violations w.h.p., and show that the non-star-convex setting can attain near-parity performance via a two-phase approach, validated by autonomous-driving simulations. The results illuminate how the geometry of the feasible action space governs safe-RL complexity and provide a foundation for extending to more expressive, nonlinear representations. Overall, the paper contributes a rigorous, geometry-aware safe-RL framework with provable guarantees and practical relevance to safety-critical domains.

Abstract

In Reinforcement Learning (RL), tasks with instantaneous hard constraints present significant challenges, particularly when the decision space is non-convex or non-star-convex. This issue is especially relevant in domains like autonomous vehicles and robotics, where constraints such as collision avoidance often take a non-convex form. In this paper, we establish a regret bound of , applicable to both star-convex and non-star-convex cases, where is the feature dimension, the episode length, the number of episodes, and the safety threshold. Moreover, the violation of safety constraints is zero with high probability throughout the learning process. A key technical challenge in these settings is bounding the covering number of the value-function class, which is essential for achieving value-aware uniform concentration in model-free function approximation. For the star-convex setting, we develop a novel technique called Objective Constraint-Decomposition (OCD) to properly bound the covering number. This result also resolves an error in a previous work on constrained RL. In non-star-convex scenarios, where the covering number can become infinitely large, we propose a two-phase algorithm, Non-Convex Safe Least Squares Value Iteration (NCS-LSVI), which first reduces uncertainty about the safe set by playing a known safe policy. After that, it carefully balances exploration and exploitation to achieve the regret bound. Finally, numerical simulations on an autonomous driving scenario demonstrate the effectiveness of NCS-LSVI.

Paper Structure

This paper contains 90 sections, 26 theorems, 134 equations, 6 figures, 1 algorithm.

Key Result

Proposition 2.2

(Proposition A.1. from jin2020provably). Let $\mathcal{F}_s \triangleq \{\phi(s,a) \in \mathbb{R}^d \mid a \in \mathcal{A} \}$, and $\mathcal{F} \triangleq \{\phi(s,a) \in \mathbb{R}^d \mid (a,s) \in \mathcal{A}\times \mathcal{S} \}$. Then, there exists a vector $\mu^* \in \mathbb{R}^d$ such that $\

Figures (6)

  • Figure 1: An illustrative example of Assumptions \ref{['assumption:epsilon_origin_1']} and \ref{['assumption:Star_ConvexAssumption_1']}. The left figure demonstrates the Local Point Assumption, where a sphere exists around the initial safe point. The right figure depicts a Star-Convex Set, where all points are connected to the initial safe point by a line segment.
  • Figure 2: Left figure: The green car (G) must decide whether to stop at the intersection for the approaching orange car (O) or accelerate to pass before O arrives. Right figure: The green car's decision space, where the red region is inaccessible due to the collision avoidance module.
  • Figure 3: Regret vs. episodes for NCS-LSVI in an autonomous vehicle merging scenario.
  • Figure 4: Plot of $f(x)$ showing the dynamics function behavior.
  • Figure 5: The diagram of autonomous vehicle example: Agent interacts with the environment and observe feedbacks on its location and speed. It utilizes the feedback to imporove the estimation of lane keeping. Then, using lane keeping and a trained collav module it provides the safe set of actions. The decision making module uses the feedback to enhance the estimation on $Q$ function, and then utilizes the saf set to make the next decision. Note that the Collav block is trained a prioir and we are not learning it, but lane keeping and Decision Making are the blocks that RL agent needs to learn.
  • ...and 1 more figures

Theorems & Definitions (33)

  • Proposition 2.2
  • Remark 2.4
  • Theorem 5.1
  • Theorem 5.2: Corrected covering-number result
  • Lemma 5.3
  • Theorem 5.4
  • Remark 6.1
  • Lemma 6.2
  • Lemma 6.3
  • Definition 3.1
  • ...and 23 more