Shielded Reinforcement Learning Under Dynamic Temporal Logic Constraints

Sadık Bera Yüksel; Ali Tevfik Buyukkocak; Derya Aksaray

Shielded Reinforcement Learning Under Dynamic Temporal Logic Constraints

Sadık Bera Yüksel, Ali Tevfik Buyukkocak, Derya Aksaray

Abstract

Reinforcement Learning (RL) has shown promise in various robotics applications, yet its deployment on real systems is still limited due to safety and operational constraints. The safe RL field has gained considerable attention in recent years, which focuses on imposing safety constraints throughout the learning process. However, real systems often require more complex constraints than just safety, such as periodic recharging or time-bounded visits to specific regions. Imposing such spatio-temporal tasks during learning still remains a challenge. Signal Temporal Logic (STL) is a formal language for specifying temporal properties of real-valued signals and provides a way to express such complex tasks. In this paper, we propose a framework that leverages sequential control barrier functions and model-free RL to ensure that the given STL tasks are satisfied throughout the learning process. Our method extends beyond traditional safety constraints by enforcing rich STL specifications, which can involve visits to dynamic targets with unknown trajectories. We also demonstrate the effectiveness of our framework through various simulations.

Shielded Reinforcement Learning Under Dynamic Temporal Logic Constraints

Abstract

Paper Structure (11 sections, 1 theorem, 24 equations, 3 figures)

This paper contains 11 sections, 1 theorem, 24 equations, 3 figures.

INTRODUCTION
PRELIMINARIES
Signal Temporal Logic
Time-Varying (Zeroing) Control Barrier Functions
PROBLEM STATEMENT
PROPOSED SOLUTION
Reinforcement Learning Under STL Constraints
Sequential CBFs for Dynamic STL Specifications
Proposed STL-Constrained RL Framework
CASE STUDIES
CONCLUSION

Key Result

Theorem 1

Consider an agent with dynamics as in (agent_dynamics), and let $\bm{b}(x, t)$ be a sequential CBF as defined in (critical_CBF). Then, if $\alpha (\bm{b}) = \gamma \bm{b}$ for some $\gamma > 0$ and $\bm{b}(x_0, 0) \geq 0$, the value of the barrier function along the system trajectory is bounded from In other words, the worst-case violation is bounded below by $-\epsilon/\gamma$, and the controller

Figures (3)

Figure 1: Overview of the proposed framework. At each time step, the RL agent outputs an unconstrained control input $u_t^{RL}$, which is then adjusted by a corrective input $u_t^{CBF}$ computed using sequential CBFs to satisfy the given STL task. The final control $\tilde{u}_t$ is applied to the system, and the resulting next state and reward $(x_{t+1}, r_t)$ are fed back to the RL agent.
Figure 2: Simulation environment setups: (a) Case 1 and (b) Case 2.
Figure 3: Learning curve comparisons of SAC models trained without STL tasks (orange) and with the STL specifications $\varPhi_1$ (blue) and $\varPhi_2$ (green). Results are averaged over 10 independent training runs, and the shaded regions represent the confidence intervals of two standard deviations.

Theorems & Definitions (6)

Definition 1: Signal Temporal Logic
Definition 2: Worst-Case Distance to Target
Theorem 1
proof
Definition 3: Feasible State Set
Remark 1

Shielded Reinforcement Learning Under Dynamic Temporal Logic Constraints

Abstract

Shielded Reinforcement Learning Under Dynamic Temporal Logic Constraints

Authors

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (6)