Table of Contents
Fetching ...

Temporal and Semantic Evaluation Metrics for Foundation Models in Post-Hoc Analysis of Robotic Sub-tasks

Jonathan Salfity, Selma Wanna, Minkyu Choi, Mitch Pryor

TL;DR

The paper tackles data scarcity for language-annotated robotic trajectories in Task and Motion Planning by introducing an automated framework that uses Foundation Models to post-hoc decompose trajectories into temporally bounded sub-tasks with natural language descriptions. It defines SIMILARITY, a pair of metrics for temporal and semantic alignment between FM-generated decompositions and ground-truth decompositions, and details a prompt-driven methodology to produce accurate sub-task labels without fine-tuning. Empirical results in Robosuite across multiple environments demonstrate high alignment when using in-context examples and textual data, with notable cost and time savings compared to human labeling. The work enables scalable creation of language-supervised robotic datasets, potentially accelerating progress in TAMP and Embodied AI applications.

Abstract

Recent works in Task and Motion Planning (TAMP) show that training control policies on language-supervised robot trajectories with quality labeled data markedly improves agent task success rates. However, the scarcity of such data presents a significant hurdle to extending these methods to general use cases. To address this concern, we present an automated framework to decompose trajectory data into temporally bounded and natural language-based descriptive sub-tasks by leveraging recent prompting strategies for Foundation Models (FMs) including both Large Language Models (LLMs) and Vision Language Models (VLMs). Our framework provides both time-based and language-based descriptions for lower-level sub-tasks that comprise full trajectories. To rigorously evaluate the quality of our automatic labeling framework, we contribute an algorithm SIMILARITY to produce two novel metrics, temporal similarity and semantic similarity. The metrics measure the temporal alignment and semantic fidelity of language descriptions between two sub-task decompositions, namely an FM sub-task decomposition prediction and a ground-truth sub-task decomposition. We present scores for temporal similarity and semantic similarity above 90%, compared to 30% of a randomized baseline, for multiple robotic environments, demonstrating the effectiveness of our proposed framework. Our results enable building diverse, large-scale, language-supervised datasets for improved robotic TAMP.

Temporal and Semantic Evaluation Metrics for Foundation Models in Post-Hoc Analysis of Robotic Sub-tasks

TL;DR

The paper tackles data scarcity for language-annotated robotic trajectories in Task and Motion Planning by introducing an automated framework that uses Foundation Models to post-hoc decompose trajectories into temporally bounded sub-tasks with natural language descriptions. It defines SIMILARITY, a pair of metrics for temporal and semantic alignment between FM-generated decompositions and ground-truth decompositions, and details a prompt-driven methodology to produce accurate sub-task labels without fine-tuning. Empirical results in Robosuite across multiple environments demonstrate high alignment when using in-context examples and textual data, with notable cost and time savings compared to human labeling. The work enables scalable creation of language-supervised robotic datasets, potentially accelerating progress in TAMP and Embodied AI applications.

Abstract

Recent works in Task and Motion Planning (TAMP) show that training control policies on language-supervised robot trajectories with quality labeled data markedly improves agent task success rates. However, the scarcity of such data presents a significant hurdle to extending these methods to general use cases. To address this concern, we present an automated framework to decompose trajectory data into temporally bounded and natural language-based descriptive sub-tasks by leveraging recent prompting strategies for Foundation Models (FMs) including both Large Language Models (LLMs) and Vision Language Models (VLMs). Our framework provides both time-based and language-based descriptions for lower-level sub-tasks that comprise full trajectories. To rigorously evaluate the quality of our automatic labeling framework, we contribute an algorithm SIMILARITY to produce two novel metrics, temporal similarity and semantic similarity. The metrics measure the temporal alignment and semantic fidelity of language descriptions between two sub-task decompositions, namely an FM sub-task decomposition prediction and a ground-truth sub-task decomposition. We present scores for temporal similarity and semantic similarity above 90%, compared to 30% of a randomized baseline, for multiple robotic environments, demonstrating the effectiveness of our proposed framework. Our results enable building diverse, large-scale, language-supervised datasets for improved robotic TAMP.
Paper Structure (15 sections, 9 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 15 sections, 9 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Our approach evaluates a Foundation Model's (FM) ability to temporally and semantically decompose a robot trajectory into a sub-task decomposition. We compare an FM sub-task decomposition, $\hat{\mathcal{S}}$, with a ground-truth sub-task decomposition, $\mathcal{S}$, through our core contributions of temporal and semantic alignment metrics. The image above shows how robot trajectory data, $\mathcal{D}$, is post-hoc processed by an FM to compute a predicted sub-task decomposition, $\hat{\mathcal{S}}$, and quantitatively compared to a ground-truth sub-task decomposition, $\mathcal{S}$.
  • Figure 2: A Prompt, $\textbf{P}$, with context, $\textbf{C}$, and Trajectory Data, $\mathcal{D}$, as described in \ref{['ssec:prompt_engineer']}. The key sections, in blue bold text, represent the different aspects of the $\textbf{P}$, including the FM tasking, $\textbf{C}^{task}$; partial in-context example, $\textbf{C}^{1S}$; textual data, $\mathcal{D}^{k x u}$; and visual data, $\mathcal{D}^{\mu}$.
  • Figure 3: Ablation study for two environments: Temporal, $\tau_{k}$, and semantic, $\tau_{\zeta}$, statistics for different FMs with varying input parameters including context $\textbf{C}$, and data modalities (textual data, $\mathcal{D}^{k x u}$, and visual data, $\mathcal{D}^{\mu}$). The key insight is an in-context example, $\textbf{C}^{1S}$ (green), significantly boosts both $\tau_{k}$ and $\tau_{\zeta}$, while visual data often decreases $\tau_{\zeta}$. Using our metrics across different FMs and $\textbf{P}$, developers can choose the best FM configuration for their application. In these examples GPT-4V with $\textbf{C}^{1S}$ and $\mathcal{D}^{k x u}$ performs the best and most consistent.

Theorems & Definitions (3)

  • Definition 1: Trajectory Data
  • Definition 2: Sub-task
  • Definition 3: Sub-task Decomposition