Multimodal Pretrained Models for Verifiable Sequential Decision-Making: Planning, Grounding, and Perception

Yunhao Yang; Cyrus Neary; Ufuk Topcu

Multimodal Pretrained Models for Verifiable Sequential Decision-Making: Planning, Grounding, and Perception

Yunhao Yang, Cyrus Neary, Ufuk Topcu

TL;DR

The paper tackles the challenge of turning multimodal pretrained model outputs into usable, verifiable sequential decision-making components. It introduces LLM2Automata to synthesize automaton-based controllers from text and Automata2Env to ground those controllers to perceptual observations via vision-language models, with formal guarantees under perceptual uncertainty. Verification is achieved by composing the controller with external knowledge models into a product automaton and applying model checking, while grounding integrates probabilistic reasoning about perception thresholds. Empirical demonstrations on a crossing-road scenario and a robot-arm manipulation task show that the approach can generate verifiable, perception-grounded controllers and quantify safety guarantees in the presence of uncertainty, illustrating practical impact for safer autonomous systems. The work advances formal verification for foundation-model-driven planning and grounding, enabling robust behavior in everyday and manipulation contexts and suggesting avenues for active perception and stronger safety guarantees.

Abstract

Recently developed pretrained models can encode rich world knowledge expressed in multiple modalities, such as text and images. However, the outputs of these models cannot be integrated into algorithms to solve sequential decision-making tasks. We develop an algorithm that utilizes the knowledge from pretrained models to construct and verify controllers for sequential decision-making tasks, and to ground these controllers to task environments through visual observations with formal guarantees. In particular, the algorithm queries a pretrained model with a user-provided, text-based task description and uses the model's output to construct an automaton-based controller that encodes the model's task-relevant knowledge. It allows formal verification of whether the knowledge encoded in the controller is consistent with other independently available knowledge, which may include abstract information on the environment or user-provided specifications. Next, the algorithm leverages the vision and language capabilities of pretrained models to link the observations from the task environment to the text-based control logic from the controller (e.g., actions and conditions that trigger the actions). We propose a mechanism to provide probabilistic guarantees on whether the controller satisfies the user-provided specifications under perceptual uncertainties. We demonstrate the algorithm's ability to construct, verify, and ground automaton-based controllers through a suite of real-world tasks, including daily life and robot manipulation tasks.

Multimodal Pretrained Models for Verifiable Sequential Decision-Making: Planning, Grounding, and Perception

TL;DR

Abstract

Paper Structure (31 sections, 2 theorems, 4 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 31 sections, 2 theorems, 4 equations, 10 figures, 1 table, 1 algorithm.

Introduction
Related Work
Formal Representations of Textual Knowledge.
Multimodal Models in Sequential Decision-Making.
Multimodal Models Grounding and Perceptions.
Preliminaries
Multimodal Pretrained Models.
Finite State Automaton.
Controllers and Models.
The Planning Domain Definition Language.
Task Controller Construction
Verifying against External Knowledge
Verifiable Grounding
The Pipeline of Automata2Env
Modifying the Controller to Handle Uncertainties
...and 16 more sections

Key Result

Theorem 1

Let $\mathbf{p}_t$ be the proposition evaluation accuracy for the true threshold $t$ and $\mathbf{p}_f$ be the accuracy for the false threshold $f$. Then, for each proposition evaluation whose result is not uncertain,

Figures (10)

Figure 1: A real-world example that applies the proposed pipeline to a robot arm manipulation task.
Figure 2: Demonstration of using the LLM2Automata algorithm to construct the controller through queries to the .
Figure 3: Demonstration of grounding the -based controller to the real-world task environment through visual perceptions.
Figure 4: The left and right figures show the proposition evaluation accuracies under different false and true thresholds.
Figure 5: A demonstration Automata2Env implementing control logic under perceptual uncertainties. The figure shows a sequence of observations from the real-world environment, where red and green boxes with confidence scores above are the object detection results from the . We use the to measure the confidence of image content and evaluate the propositions from the controller. The resulting controller's state transitions are $q_1 \rightarrow q_2 \rightarrow q_2 \rightarrow q_3 \rightarrow q_3 \rightarrow q_4$.
...and 5 more figures

Theorems & Definitions (2)

Theorem 1
Theorem 2

Multimodal Pretrained Models for Verifiable Sequential Decision-Making: Planning, Grounding, and Perception

TL;DR

Abstract

Multimodal Pretrained Models for Verifiable Sequential Decision-Making: Planning, Grounding, and Perception

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (2)