Safety Guardrails for LLM-Enabled Robots
Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J. Pappas, Hamed Hassani
TL;DR
This paper addresses safety risks in LLM-enabled robots, particularly adversarial jailbreaking that can cause physical harm. It proposes RoboGuard, a two-stage guardrail combining a root-of-trust LLM with chain-of-thought reasoning to-ground high-level safety rules into $LTL$ specifications, and a formal control-synthesis stage that ensures any LLM plan satisfies safety constraints via a Buchi-automaton-based check. The approach demonstrates substantial reductions in unsafe behavior (from $92\%$ to $<2.5\%$) in both simulation and real-world experiments, while maintaining performance on safe tasks and showing resilience to adaptive attacks. The work contributes a general, context-aware safeguard that is resource-efficient and adaptable to different robot platforms and planning architectures, with practical implications for safer deployment of AI-enabled robotics in open-world settings.
Abstract
Although the integration of large language models (LLMs) into robotics has unlocked transformative capabilities, it has also introduced significant safety concerns, ranging from average-case LLM errors (e.g., hallucinations) to adversarial jailbreaking attacks, which can produce harmful robot behavior in real-world settings. Traditional robot safety approaches do not address the novel vulnerabilities of LLMs, and current LLM safety guardrails overlook the physical risks posed by robots operating in dynamic real-world environments. In this paper, we propose RoboGuard, a two-stage guardrail architecture to ensure the safety of LLM-enabled robots. RoboGuard first contextualizes pre-defined safety rules by grounding them in the robot's environment using a root-of-trust LLM, which employs chain-of-thought (CoT) reasoning to generate rigorous safety specifications, such as temporal logic constraints. RoboGuard then resolves potential conflicts between these contextual safety specifications and a possibly unsafe plan using temporal logic control synthesis, which ensures safety compliance while minimally violating user preferences. Through extensive simulation and real-world experiments that consider worst-case jailbreaking attacks, we demonstrate that RoboGuard reduces the execution of unsafe plans from 92% to below 2.5% without compromising performance on safe plans. We also demonstrate that RoboGuard is resource-efficient, robust against adaptive attacks, and significantly enhanced by enabling its root-of-trust LLM to perform CoT reasoning. These results underscore the potential of RoboGuard to mitigate the safety risks and enhance the reliability of LLM-enabled robots.
