Table of Contents
Fetching ...

Accelerated Learning with Linear Temporal Logic using Differentiable Simulation

Alper Kamil Bozkurt, Calin Belta, Ming C. Lin

Abstract

Ensuring that reinforcement learning (RL) controllers satisfy safety and reliability constraints in real-world settings remains challenging: state-avoidance and constrained Markov decision processes often fail to capture trajectory-level requirements or induce overly conservative behavior. Formal specification languages such as linear temporal logic (LTL) offer correct-by-construction objectives, yet their rewards are typically sparse, and heuristic shaping can undermine correctness. We introduce, to our knowledge, the first end-to-end framework that integrates LTL with differentiable simulators, enabling efficient gradient-based learning directly from formal specifications. Our method relaxes discrete automaton transitions via soft labeling of states, yielding differentiable rewards and state representations that mitigate the sparsity issue intrinsic to LTL while preserving objective soundness. We provide theoretical guarantees connecting Büchi acceptance to both discrete and differentiable LTL returns and derive a tunable bound on their discrepancy in deterministic and stochastic settings. Empirically, across complex, nonlinear, contact-rich continuous-control tasks, our approach substantially accelerates training and achieves up to twice the returns of discrete baselines. We further demonstrate compatibility with reward machines, thereby covering co-safe LTL and LTL$_\text{f}$ without modification. By rendering automaton-based rewards differentiable, our work bridges formal methods and deep RL, enabling safe, specification-driven learning in continuous domains.

Accelerated Learning with Linear Temporal Logic using Differentiable Simulation

Abstract

Ensuring that reinforcement learning (RL) controllers satisfy safety and reliability constraints in real-world settings remains challenging: state-avoidance and constrained Markov decision processes often fail to capture trajectory-level requirements or induce overly conservative behavior. Formal specification languages such as linear temporal logic (LTL) offer correct-by-construction objectives, yet their rewards are typically sparse, and heuristic shaping can undermine correctness. We introduce, to our knowledge, the first end-to-end framework that integrates LTL with differentiable simulators, enabling efficient gradient-based learning directly from formal specifications. Our method relaxes discrete automaton transitions via soft labeling of states, yielding differentiable rewards and state representations that mitigate the sparsity issue intrinsic to LTL while preserving objective soundness. We provide theoretical guarantees connecting Büchi acceptance to both discrete and differentiable LTL returns and derive a tunable bound on their discrepancy in deterministic and stochastic settings. Empirically, across complex, nonlinear, contact-rich continuous-control tasks, our approach substantially accelerates training and achieves up to twice the returns of discrete baselines. We further demonstrate compatibility with reward machines, thereby covering co-safe LTL and LTL without modification. By rendering automaton-based rewards differentiable, our work bridges formal methods and deep RL, enabling safe, specification-driven learning in continuous domains.

Paper Structure

This paper contains 64 sections, 3 theorems, 18 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

A memoryless product policy $\boldsymbol{\pi}_\varphi^*$ that maximizes the probability of satisfying the Büchi condition in a product MDP $\mathbf{M}$ constructed from a given labeled MDP $M^+$ and the LDBA $\mathcal{A}_\varphi$ derived from a given LTL specification $\varphi$, induces a policy $\p

Figures (9)

  • Figure 1: LTL Returns and Derivatives.Left: The parking scenario where the car must brake to stop in the parking area without entering the grass field ($\varphi_p$). Middle: LTL satisfaction probability and return estimates from discrete and differentiable LTL formulations as functions of deceleration. Right: LTL return gradients with respect to deceleration and their standard deviation. The key challenge in learning from LTL arises from slightly-sloped regions and sharp changes in the returns produced by discrete LTL rewards. Our differentiable LTL approach not only smooths these abrupt changes but also enables the use of low-variance first-order gradient estimates essential for effective learning in slightly-sloped regions.
  • Figure 2: Comparison Across Environments: Differentiable vs. Discrete LTL Rewards. The wider plots show the learning curves of all baseline algorithms, while the narrower plots on the right display the maximum returns achieved after 100 M steps. All results are averaged over 5 random seeds, and the curves are smoothed using max and uniform filters for visual clarity. The reported returns, bounded between 0 and 1, serve as proxies for the probability of satisfying the LTL specifications. In all the environments, algorithms utilizing differentiable LTL rewards (SHAC, AHAC) rapidly learn near-optimal policies, whereas those relying on discrete LTL rewards (PPO, SAC), display high variance, converge slowly, or are stuck with sub-optimal policies.
  • Figure 3: Convergence speed comparison of stochastic gradient descent algorithms using $\bar{\nabla}_\theta^{[0]}$ and $\bar{\nabla}_\theta^{[1]}$ for the parking example ($N=10$).
  • Figure 4: Task Specification with LTL. This figure illustrates a Cheetah policy learned by SHAC using differentiable rewards derived via our approach from the LTL formula $\varphi_\text{legged}$, which specifies accelerating forward, stopping, and maintaining a safe tip-to-ground distance. Specifying the desired behaviors of robots using the high-level language LTL provides is an intuitive alternative to manually designing reward functions, which often require extensive domain expertise and risk unintended behaviors. Enabling learning directly from LTL unlocks new possibilities for robust, safe, and flexible robotic applications. See the supplementary material for the video.
  • Figure 5: The $\omega$-automaton derived from $\varphi_\text{cartpole}$.
  • ...and 4 more figures

Theorems & Definitions (10)

  • Definition 1
  • Definition 2
  • Lemma 1: from Theorem 3 in sickert2016
  • Theorem 1
  • Theorem 2
  • proof
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6