Table of Contents
Fetching ...

Pretrained Embeddings as a Behavior Specification Mechanism

Parv Kapoor, Abigail Hammer, Ashish Kapoor, Karen Leung, Eunsuk Kang

TL;DR

This work addresses the challenge of formally specifying behaviors for AI-enabled systems that rely on perception by introducing embeddings as first-class objects in a specification language. It proposes Embedding Temporal Logic (ETL), allowing properties to be defined via distances between target and observed embeddings, and integrates pretrained vision models and world models to enable planning with embedding-based specifications. The paper defines ETL syntax, semantics, and quantitative satisfaction, and demonstrates through examples and preliminary experiments in navigation and manipulation that ETL-guided planning can steer systems toward desirable behaviors. The findings suggest embedding-based specifications broaden the scope of verifiable properties for AI systems and highlight practical considerations for distance metrics, target embedding specification, and future avenues in monitoring, verification, and explainability.

Abstract

We propose an approach to formally specifying the behavioral properties of systems that rely on a perception model for interactions with the physical world. The key idea is to introduce embeddings -- mathematical representations of a real-world concept -- as a first-class construct in a specification language, where properties are expressed in terms of distances between a pair of ideal and observed embeddings. To realize this approach, we propose a new type of temporal logic called Embedding Temporal Logic (ETL), and describe how it can be used to express a wider range of properties about AI-enabled systems than previously possible. We demonstrate the applicability of ETL through a preliminary evaluation involving planning tasks in robots that are driven by foundation models; the results are promising, showing that embedding-based specifications can be used to steer a system towards desirable behaviors.

Pretrained Embeddings as a Behavior Specification Mechanism

TL;DR

This work addresses the challenge of formally specifying behaviors for AI-enabled systems that rely on perception by introducing embeddings as first-class objects in a specification language. It proposes Embedding Temporal Logic (ETL), allowing properties to be defined via distances between target and observed embeddings, and integrates pretrained vision models and world models to enable planning with embedding-based specifications. The paper defines ETL syntax, semantics, and quantitative satisfaction, and demonstrates through examples and preliminary experiments in navigation and manipulation that ETL-guided planning can steer systems toward desirable behaviors. The findings suggest embedding-based specifications broaden the scope of verifiable properties for AI systems and highlight practical considerations for distance metrics, target embedding specification, and future avenues in monitoring, verification, and explainability.

Abstract

We propose an approach to formally specifying the behavioral properties of systems that rely on a perception model for interactions with the physical world. The key idea is to introduce embeddings -- mathematical representations of a real-world concept -- as a first-class construct in a specification language, where properties are expressed in terms of distances between a pair of ideal and observed embeddings. To realize this approach, we propose a new type of temporal logic called Embedding Temporal Logic (ETL), and describe how it can be used to express a wider range of properties about AI-enabled systems than previously possible. We demonstrate the applicability of ETL through a preliminary evaluation involving planning tasks in robots that are driven by foundation models; the results are promising, showing that embedding-based specifications can be used to steer a system towards desirable behaviors.

Paper Structure

This paper contains 27 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Image-to-image embedding distances We compute distance metrics between the embeddings of images captured within a Habitat scene. Each image is processed by a pretrained encoder, from which we extract the corresponding embeddings. We then calculate the pairwise distances between these embeddings using different distance metrics.
  • Figure 2: Overview of Planning with ETL specifications We utilize Habitat for navigation and NVIDIA FLeX for granular manipulation. Pretrained encoders generate embeddings from goal and current observations, which are used to evaluate specification satisfaction. The planner integrates these embeddings with a world model to generate ETL satisfying actions.
  • Figure 3: Image-to-image embedding distances We compute distance metrics between the embeddings of images captured within a Habitat scene. Each image is processed by a pretrained encoder, from which we extract the corresponding embeddings. We then calculate the pairwise distances between these embeddings using different distance metrics.
  • Figure 4: Text-image embedding distances We demonstrate how models like OpenCLIP can generate textual embeddings that can be compared with image embeddings. This enables the capture of arbitrary requirements through text while serving as a reliable specification mechanism for behavior.
  • Figure 5: Ending states after planning with a pretrained PACT world model in Habitat environment using L2 distance as the base metric. We present the start state as well as the goal images used for $\varphi_1$, $\varphi_2$, and $\varphi_3$ in different scenes. The planner is able to succesfully achieve positive satisfaction scores for all ETL specifications.
  • ...and 1 more figures