Automaton Constrained Q-Learning
Anastasios Manganaris, Vittorio Giammarino, Ahmed H. Qureshi
TL;DR
ACQL tackles learning for temporally extended robotic tasks with evolving safety constraints by marrying automaton-guided progress with goal-conditioned Q-learning. It introduces an augmented product CMDP where automaton states, subgoals, and safety constraints are integrated, and enforces safety via a minimum-safety objective rather than long-horizon costs. The method densifies sparse LTL-derived rewards through subgoal relabeling (HER) and guarantees convergence of the learned value functions under mild conditions. Empirical results on multiple continuous-control tasks and a real UR5e deployment show ACQL outperforms baselines like RM and LOF, with ablations confirming the essential role of the minimum-safety formulation and subgoal relabeling for robust, scalable LTL-compliant robotic control.
Abstract
Real-world robotic tasks often require agents to achieve sequences of goals while respecting time-varying safety constraints. However, standard Reinforcement Learning (RL) paradigms are fundamentally limited in these settings. A natural approach to these problems is to combine RL with Linear-time Temporal Logic (LTL), a formal language for specifying complex, temporally extended tasks and safety constraints. Yet, existing RL methods for LTL objectives exhibit poor empirical performance in complex and continuous environments. As a result, no scalable methods support both temporally ordered goals and safety simultaneously, making them ill-suited for realistic robotics scenarios. We propose Automaton Constrained Q-Learning (ACQL), an algorithm that addresses this gap by combining goal-conditioned value learning with automaton-guided reinforcement. ACQL supports most LTL task specifications and leverages their automaton representation to explicitly encode stage-wise goal progression and both stationary and non-stationary safety constraints. We show that ACQL outperforms existing methods across a range of continuous control tasks, including cases where prior methods fail to satisfy either goal-reaching or safety constraints. We further validate its real-world applicability by deploying ACQL on a 6-DOF robotic arm performing a goal-reaching task in a cluttered, cabinet-like space with safety constraints. Our results demonstrate that ACQL is a robust and scalable solution for learning robotic behaviors according to rich temporal specifications.
