Table of Contents
Fetching ...

Automaton Constrained Q-Learning

Anastasios Manganaris, Vittorio Giammarino, Ahmed H. Qureshi

TL;DR

ACQL tackles learning for temporally extended robotic tasks with evolving safety constraints by marrying automaton-guided progress with goal-conditioned Q-learning. It introduces an augmented product CMDP where automaton states, subgoals, and safety constraints are integrated, and enforces safety via a minimum-safety objective rather than long-horizon costs. The method densifies sparse LTL-derived rewards through subgoal relabeling (HER) and guarantees convergence of the learned value functions under mild conditions. Empirical results on multiple continuous-control tasks and a real UR5e deployment show ACQL outperforms baselines like RM and LOF, with ablations confirming the essential role of the minimum-safety formulation and subgoal relabeling for robust, scalable LTL-compliant robotic control.

Abstract

Real-world robotic tasks often require agents to achieve sequences of goals while respecting time-varying safety constraints. However, standard Reinforcement Learning (RL) paradigms are fundamentally limited in these settings. A natural approach to these problems is to combine RL with Linear-time Temporal Logic (LTL), a formal language for specifying complex, temporally extended tasks and safety constraints. Yet, existing RL methods for LTL objectives exhibit poor empirical performance in complex and continuous environments. As a result, no scalable methods support both temporally ordered goals and safety simultaneously, making them ill-suited for realistic robotics scenarios. We propose Automaton Constrained Q-Learning (ACQL), an algorithm that addresses this gap by combining goal-conditioned value learning with automaton-guided reinforcement. ACQL supports most LTL task specifications and leverages their automaton representation to explicitly encode stage-wise goal progression and both stationary and non-stationary safety constraints. We show that ACQL outperforms existing methods across a range of continuous control tasks, including cases where prior methods fail to satisfy either goal-reaching or safety constraints. We further validate its real-world applicability by deploying ACQL on a 6-DOF robotic arm performing a goal-reaching task in a cluttered, cabinet-like space with safety constraints. Our results demonstrate that ACQL is a robust and scalable solution for learning robotic behaviors according to rich temporal specifications.

Automaton Constrained Q-Learning

TL;DR

ACQL tackles learning for temporally extended robotic tasks with evolving safety constraints by marrying automaton-guided progress with goal-conditioned Q-learning. It introduces an augmented product CMDP where automaton states, subgoals, and safety constraints are integrated, and enforces safety via a minimum-safety objective rather than long-horizon costs. The method densifies sparse LTL-derived rewards through subgoal relabeling (HER) and guarantees convergence of the learned value functions under mild conditions. Empirical results on multiple continuous-control tasks and a real UR5e deployment show ACQL outperforms baselines like RM and LOF, with ablations confirming the essential role of the minimum-safety formulation and subgoal relabeling for robust, scalable LTL-compliant robotic control.

Abstract

Real-world robotic tasks often require agents to achieve sequences of goals while respecting time-varying safety constraints. However, standard Reinforcement Learning (RL) paradigms are fundamentally limited in these settings. A natural approach to these problems is to combine RL with Linear-time Temporal Logic (LTL), a formal language for specifying complex, temporally extended tasks and safety constraints. Yet, existing RL methods for LTL objectives exhibit poor empirical performance in complex and continuous environments. As a result, no scalable methods support both temporally ordered goals and safety simultaneously, making them ill-suited for realistic robotics scenarios. We propose Automaton Constrained Q-Learning (ACQL), an algorithm that addresses this gap by combining goal-conditioned value learning with automaton-guided reinforcement. ACQL supports most LTL task specifications and leverages their automaton representation to explicitly encode stage-wise goal progression and both stationary and non-stationary safety constraints. We show that ACQL outperforms existing methods across a range of continuous control tasks, including cases where prior methods fail to satisfy either goal-reaching or safety constraints. We further validate its real-world applicability by deploying ACQL on a 6-DOF robotic arm performing a goal-reaching task in a cluttered, cabinet-like space with safety constraints. Our results demonstrate that ACQL is a robust and scalable solution for learning robotic behaviors according to rich temporal specifications.

Paper Structure

This paper contains 36 sections, 7 theorems, 13 equations, 8 figures, 2 tables, 4 algorithms.

Key Result

Proposition 1

Let $\mathcal{M^A}$ be an augmented CMDP with $|\mathcal{S}^A| < \infty$, $|\mathcal{A}| < \infty$, and $\gamma \in [0, 1)$, and let $Q^c_n$ and $Q^r_n$ be models for the state-action safety and value functions indexed by $n$. Assume they are updated using Robbins-Monro step sizes $a(n)$ and $b(n)$,

Figures (8)

  • Figure 1: The ACQL algorithm relies on a novel augmented formulation of CMDP (left). An input task specification $\phi$ is converted into the DBA $A$, from which safety constraints and subgoals are collected into mappings $S$ and $G$ respectively. The learning agent receives subgoals $g_1, \cdots, g_n$ at every stage in the task from $G$ and safety constraint feedback in $c^A$ from $S$. Trajectories induced by the policy $\pi_j$ are collected into a replay buffer $R$, from which batches $\mathcal{B}^{\tau}_j$ are sampled and modified using HER andrychowicz2017her. From these modified trajectories, mini-batches $\mathcal{B}_i$ of transitions $(s^A_t, a_t, r^A_t, c^A_t, s^A_{t+1})$ are used to compute the targets $y_t^r$ in \ref{['eqn:normal-bellman-loss-and-target']} and $y_t^c$ in \ref{['eqn:safety-bellman-target']} for training models of the state-action value function and safety function, $Q^r_\theta$ and $Q^c_\psi$, from which an updated policy $\pi_{j+1}$ is derived.
  • Figure 2: ACQL policies trained with safety constraints based a cabinet's geometry can be successfully deployed for a UR5e manipulator operating in the real cabinet environment.
  • Figure 3: Contour plots for both $Q^c_\theta$ and $Q^r_\psi$ trained with ACQL with our CMDP formulation in \ref{['eqn:min-cost-cmdp']} and the standard formulation in \ref{['eqn:normal-cmdp-objective']}.
  • Figure 4: Automaton for the task "Reach goal $g_1$ or $g_2$ while never entering an unsafe-region $u_1$. Then reach $g_3$.", where achieving the goal $g_i$ corresponds to the atomic proposition $p_i$ and entering $u_1$ corresponds to $p_4$. The full LTL expression is $\neg p_4 \; \mathcal{U} \; ((p_1 \vee p_2) \wedge \circ \lozenge p_3)$. The proposition $p_4$ is only relevant to the task's safety constraint, and the propositions $p_1$, $p_2$, and $p_3$ are only relevant to the task's liveness constraints.
  • Figure 5: Average and one standard deviation of episode reward throughout training for the five runs per method that are summarized in Table \ref{['tab:task-experiments']} in our main paper.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Proposition 1
  • Lemma 1
  • proof
  • Lemma 2
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • ...and 2 more