Table of Contents
Fetching ...

Joint Verification and Refinement of Language Models for Safety-Constrained Planning

Yunhao Yang, Neel P. Bhatt, William Ward, Zichao Hu, Joydeep Biswas, Ufuk Topcu

TL;DR

This work tackles the risk of LLM-generated robot programs violating safety specifications by introducing a formal verification pipeline that translates programs into finite-state automata and checks them against temporal logic safety properties. A key compositional verification theorem guarantees that any composition of individually verified subprograms remains safe, dramatically reducing verification complexity for long-horizon tasks. The authors accompany verification with a verification-guided fine-tuning procedure that uses verified subcomponents as supervision, achieving a 30% rise in specification-compliant programs and a ~50% reduction in training time. Empirical demonstrations on outdoor driving and indoor robots, along with out-of-domain validation and baselines, establish the practicality and efficiency of the approach for safer, scalable real-world robotic programming. The work lays groundwork for multimodal and human-in-the-loop extensions to further enhance safety in dynamic environments.

Abstract

Large language models possess impressive capabilities in generating programs (e.g., Python) from natural language descriptions to execute robotic tasks. However, these generated programs often contain errors that violate externally given task specifications. Without an effective method to verify their correctness, the reliable deployment of language models in real-world systems is practically infeasible. We develop a method that converts generated robot programs into an automaton-based representation and verifies them against task-relevant safety specifications. We establish a theorem that any arbitrary combination of the verified programs will also satisfy the safety specifications. Hence, the method eliminates the need to verify complex programs composed of multiple simpler ones, reducing computation complexity. We then introduce an automated fine-tuning procedure that leverages verification outcomes for supervision. By applying the theorem, this procedure only requires training the model to generate safe sub-components, thereby improving training efficiency. Empirical results on robot applications show a 30 percent increase in the probability of generating specification-compliant programs, with training time reduced by half compared to fine-tuning on generating full programs.

Joint Verification and Refinement of Language Models for Safety-Constrained Planning

TL;DR

This work tackles the risk of LLM-generated robot programs violating safety specifications by introducing a formal verification pipeline that translates programs into finite-state automata and checks them against temporal logic safety properties. A key compositional verification theorem guarantees that any composition of individually verified subprograms remains safe, dramatically reducing verification complexity for long-horizon tasks. The authors accompany verification with a verification-guided fine-tuning procedure that uses verified subcomponents as supervision, achieving a 30% rise in specification-compliant programs and a ~50% reduction in training time. Empirical demonstrations on outdoor driving and indoor robots, along with out-of-domain validation and baselines, establish the practicality and efficiency of the approach for safer, scalable real-world robotic programming. The work lays groundwork for multimodal and human-in-the-loop extensions to further enhance safety in dynamic environments.

Abstract

Large language models possess impressive capabilities in generating programs (e.g., Python) from natural language descriptions to execute robotic tasks. However, these generated programs often contain errors that violate externally given task specifications. Without an effective method to verify their correctness, the reliable deployment of language models in real-world systems is practically infeasible. We develop a method that converts generated robot programs into an automaton-based representation and verifies them against task-relevant safety specifications. We establish a theorem that any arbitrary combination of the verified programs will also satisfy the safety specifications. Hence, the method eliminates the need to verify complex programs composed of multiple simpler ones, reducing computation complexity. We then introduce an automated fine-tuning procedure that leverages verification outcomes for supervision. By applying the theorem, this procedure only requires training the model to generate safe sub-components, thereby improving training efficiency. Empirical results on robot applications show a 30 percent increase in the probability of generating specification-compliant programs, with training time reduced by half compared to fine-tuning on generating full programs.

Paper Structure

This paper contains 15 sections, 3 theorems, 1 equation, 11 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Let $\phi$ describe safety property $P_{\mathrm{safe}}$, an automaton $\mathcal{P}$ satisfies $\phi$ (denoted as $\mathcal{P} \models \phi$) if and only if Traces($\mathcal{P}$) $\subseteq P_{\mathrm{safe}}$.

Figures (11)

  • Figure 1: A composed program of three subprograms. The states of each subprogram are connected in black transitions. The orange dashed transitions connect the subprograms. We establish a theorem that any combination of individually verified subprograms also satisfies the safety specifications.
  • Figure 2: In this pipeline, we verify each LLM-generated program and add the specification-compliant programs to a set of safety-constrained programs, which will be used for fine-tuning the LLM.
  • Figure 3: An example of a joint automaton $\mathcal{P}^* = (Q_1 \cup Q_2, Q_{0_1}, T_1 \cup T_2 \cup T^*, L_1 \cup L_2)$ of $\mathcal{P}_1$ and $\mathcal{P}_2$. We mark $\mathcal{P}_1$ and $\mathcal{P}_2$ in blue and purple, and mark the transition in $T^*$ in orange.
  • Figure 4: By Theorem \ref{['thm: safe']}, we can safely execute the composed program to solve complex tasks without further verification.
  • Figure 5: The three robots we used in the experiments. From left to right, we name them Jackal outdoor robot, Jackal indoor robot, and Spot robot dog.
  • ...and 6 more figures

Theorems & Definitions (13)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Proposition 1
  • Definition 7
  • Remark 1
  • Theorem 1
  • ...and 3 more