ExTraCT -- Explainable Trajectory Corrections from language inputs using Textual description of features

J-Anne Yow; Neha Priyadarshini Garg; Manoj Ramanathan; Wei Tech Ang

ExTraCT -- Explainable Trajectory Corrections from language inputs using Textual description of features

J-Anne Yow, Neha Priyadarshini Garg, Manoj Ramanathan, Wei Tech Ang

TL;DR

ExTraCT addresses the problem of modifying robot trajectories from natural language by grounding corrections into a finite set of interpretable trajectory modification features via textual descriptions and semantic similarity. It separates language understanding from trajectory deformation, resulting in axiomatic trajectory updates where $ξ^* = δ(φ^*, ξ_0, E)$ with $φ^* = \arg\max_{φ ∈ Φ} P(φ|l)$ and a subsequent optimizer enforcing kinodynamic constraints. Features are categorized as scene-specific or scene-independent, described by templates $T_φ$ and embedded to compute $P(T_φ|l) ∝ \max_{t_φ∈T_φ} q(t_φ)^T q(l) / (||q(t_φ)|| \cdot ||q(l)||)$. Empirical results across simulation and real-robot experiments show ExTraCT achieving higher accuracy and stronger user preference than end-to-end baselines, while offering improved interpretability and generalization to unseen trajectories and object configurations, demonstrated in tasks including manipulation and assistive feeding.

Abstract

Natural language provides an intuitive and expressive way of conveying human intent to robots. Prior works employed end-to-end methods for learning trajectory deformations from language corrections. However, such methods do not generalize to new initial trajectories or object configurations. This work presents ExTraCT, a modular framework for trajectory corrections using natural language that combines Large Language Models (LLMs) for natural language understanding and trajectory deformation functions. Given a scene, ExTraCT generates the trajectory modification features (scene-specific and scene-independent) and their corresponding natural language textual descriptions for the objects in the scene online based on a template. We use LLMs for semantic matching of user utterances to the textual descriptions of features. Based on the feature matched, a trajectory modification function is applied to the initial trajectory, allowing generalization to unseen trajectories and object configurations. Through user studies conducted both in simulation and with a physical robot arm, we demonstrate that trajectories deformed using our method were more accurate and were preferred in about 80\% of cases, outperforming the baseline. We also showcase the versatility of our system in a manipulation task and an assistive feeding task.

ExTraCT -- Explainable Trajectory Corrections from language inputs using Textual description of features

TL;DR

with

and a subsequent optimizer enforcing kinodynamic constraints. Features are categorized as scene-specific or scene-independent, described by templates

and embedded to compute

. Empirical results across simulation and real-robot experiments show ExTraCT achieving higher accuracy and stronger user preference than end-to-end baselines, while offering improved interpretability and generalization to unseen trajectories and object configurations, demonstrated in tasks including manipulation and assistive feeding.

Abstract

Paper Structure (26 sections, 4 equations, 7 figures, 6 tables)

This paper contains 26 sections, 4 equations, 7 figures, 6 tables.

Introduction
Related Work
Semantic Parsing to Probabilistic Graphs
End-to-End Learning Using Embeddings
Prompting LLMs to Generate Code
Approach
Problem Definition
Features
Textual Descriptions and Optimal Feature Selection
Deformation Function
Experiments
Baseline
Generalization Experiments
Evaluation Metrics
Results
...and 11 more sections

Figures (7)

Figure 1: Architecture of ExTraCT. Given the objects in a scene, the features $\phi_{}$ and corresponding textual descriptions $T_{\phi}$ are generated online. We obtain the embeddings of the language correction $q(l_{}^{}\xspace)$ and the phrases ($t_{\phi}\xspace \in T_{\phi}\xspace$) in the textual descriptions of the features $q(t_{\phi}\xspace)$, and use semantic textual similarity to obtain the most similar textual description, which is mapped to feature $\phi^{*}$. A deformation function $\delta$ is used to deform the initial trajectory $\xi_{0}$ based on the feature $\phi^{*}$ and the object positions in the environment $E_{}^{}$. A trajectory optimizer is used to ensure that the robot's kinematic constraints are satisfied.
Figure 2: Accuracy evaluation. The green trajectory shows an incorrect deformation, while the blue trajectory shows a correct deformation. (a) Cartesian changes -- we sampled trajectory deformations with varying weights, which affect the intensity of deformation. The sampled trajectories below the original trajectory have negative weights, while those above the original trajectory have positive weights. (b) Object distance changes -- we obtained the waypoints in the original and deformed trajectories and compared the change in distance relative to the target object.
Figure 3: Changes in the deformed trajectory using (a) a sample in LaTTe's dataset (b) a change in the target object pose (c) a change in the language correction that conveys an opposite meaning (d) a change in the initial trajectory. The deformed trajectory by LaTTe is inaccurate for (b), (c) and (d), while ExTraCT produces a correct trajectory deformation for all cases.
Figure 4: Interface for the simulated study showing scene 1. The modified trajectories are displayed on the interface simultaneously for better comparison. The green modified trajectory is by LaTTe, while the blue modified trajectory is by our approach.
Figure 5: The deformed trajectories and features matched for (a) scene 2 and (b) scene 3. Note that the modified trajectory by LaTTe opposes the corrections provided in these examples.
...and 2 more figures

ExTraCT -- Explainable Trajectory Corrections from language inputs using Textual description of features

TL;DR

Abstract

ExTraCT -- Explainable Trajectory Corrections from language inputs using Textual description of features

Authors

TL;DR

Abstract

Table of Contents

Figures (7)