Table of Contents
Fetching ...

Grounding Language Plans in Demonstrations Through Counterfactual Perturbations

Yanwei Wang, Tsun-Hsuan Wang, Jiayuan Mao, Michael Hagenow, Julie Shah

TL;DR

The paper addresses grounding common-sense language plans from LLMs into embodied robot behavior by introducing GLiDE, which represents task structure as discrete mode families and grounds language-based plans through end-to-end learning. It augments a small set of demonstrations with counterfactual perturbations, prompts LLMs to extract a $K$-mode plan and a feasibility matrix $F^K$, and trains a differentiable mode classifier $\phi(s)$ along with per-mode policies, optionally using a pseudo-attractor to improve robustness. Key contributions include demonstration augmentation with counterfactuals, explanation-based learning to recover mode boundaries, and a grounding operator that enables interpretable replanning and reactive policies across 2D navigation, Robosuite tasks, and real-robot experiments. The approach enhances interpretability and reactivity in imitation learning and offers a pathway to grounding semantic language in physical action spaces with limited labeled data, potentially improving robustness under perturbations in embodied AI.

Abstract

Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI. Whereas prior works have focused on leveraging LLMs directly for planning in symbolic spaces, this work uses LLMs to guide the search of task structures and constraints implicit in multi-step demonstrations. Specifically, we borrow from manipulation planning literature the concept of mode families, which group robot configurations by specific motion constraints, to serve as an abstraction layer between the high-level language representations of an LLM and the low-level physical trajectories of a robot. By replaying a few human demonstrations with synthetic perturbations, we generate coverage over the demonstrations' state space with additional successful executions as well as counterfactuals that fail the task. Our explanation-based learning framework trains an end-to-end differentiable neural network to predict successful trajectories from failures and as a by-product learns classifiers that ground low-level states and images in mode families without dense labeling. The learned grounding classifiers can further be used to translate language plans into reactive policies in the physical domain in an interpretable manner. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks. Website: https://yanweiw.github.io/glide

Grounding Language Plans in Demonstrations Through Counterfactual Perturbations

TL;DR

The paper addresses grounding common-sense language plans from LLMs into embodied robot behavior by introducing GLiDE, which represents task structure as discrete mode families and grounds language-based plans through end-to-end learning. It augments a small set of demonstrations with counterfactual perturbations, prompts LLMs to extract a -mode plan and a feasibility matrix , and trains a differentiable mode classifier along with per-mode policies, optionally using a pseudo-attractor to improve robustness. Key contributions include demonstration augmentation with counterfactuals, explanation-based learning to recover mode boundaries, and a grounding operator that enables interpretable replanning and reactive policies across 2D navigation, Robosuite tasks, and real-robot experiments. The approach enhances interpretability and reactivity in imitation learning and offers a pathway to grounding semantic language in physical action spaces with limited labeled data, potentially improving robustness under perturbations in embodied AI.

Abstract

Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI. Whereas prior works have focused on leveraging LLMs directly for planning in symbolic spaces, this work uses LLMs to guide the search of task structures and constraints implicit in multi-step demonstrations. Specifically, we borrow from manipulation planning literature the concept of mode families, which group robot configurations by specific motion constraints, to serve as an abstraction layer between the high-level language representations of an LLM and the low-level physical trajectories of a robot. By replaying a few human demonstrations with synthetic perturbations, we generate coverage over the demonstrations' state space with additional successful executions as well as counterfactuals that fail the task. Our explanation-based learning framework trains an end-to-end differentiable neural network to predict successful trajectories from failures and as a by-product learns classifiers that ground low-level states and images in mode families without dense labeling. The learned grounding classifiers can further be used to translate language plans into reactive policies in the physical domain in an interpretable manner. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks. Website: https://yanweiw.github.io/glide
Paper Structure (20 sections, 3 equations, 10 figures, 3 tables)

This paper contains 20 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: GLiDE framework Given a common-sense LLM that understands (a) the appropriate state abstractions for a task and (b) how to solve the task via a sequence of manipulation modes in semantic space and (c) a few unsegmented human demonstrations that embody the transitions through these modes, we learn a grounding classifier that maps continuous physical states and observations to discrete semantic modes. Mode boundaries discovered by the classifier encode constraints implicit in the demonstrations that are critical for task success.
  • Figure 2: (a-c) Example perturbations causing replays (blue) to deviate from successful demonstrations (red). The task is to pick up the square nut and place it on the peg. End-effector perturbations at different locations (a) may or (b) may not cause grasp failures. (c) The gripper picks up the nut despite an initial end-effector perturbation but later drops it due to a gripper perturbation. LLMs can be prompted (d) to describe a task solution via a discrete mode sequence or (e) to select relevant features and pseudo attractors for solving a task.
  • Figure 3: (a) Example feasibility matrices. Specifically, $F^3$ can describe the modal structure for a pick-and-place task with solution reach$\rightarrow$grasp$\rightarrow$transport, where reach$\rightarrow$transport directly is infeasible. (b) The definition of a mode transition implies every state in the second mode is reachable from every state in the first mode (states Y and Z are in the same mode but not X). We leverage this connection between the continuous states and the discrete modes to design (c) a fully-differentiable pipeline that calculates overall trajectory success based on the mode classification of individual states in the trajectory.
  • Figure 4: Grounding of 2D navigation task. (a) Given six demonstrations that start in mode 1 and end in mode 5, visualized on top of the ground truth, (b) our method GLiDE recovers the underlying mode abstractions. (c) Without counterfactual data, GLiDE fails to learn precise boundaries. (d) Without a correct feasibility matrix (e.g.4-mode instead of 5-mode), GLiDE results will miss modes. (e) Lastly, clustering the 2D state space to the nearest mode centers, discovered in the demonstrations by kmeans++, produces an incorrect modal structure.
  • Figure 5: Illustration of the real robot 2D navigation task (a), where the end-effector traces through a sequence of colored polygons. (b) shows a perturbed trajectory, overlaid on ground truth mode boundaries, experiences an invalid transition from mode 1 to mode 4. A vision-based classifier can predict from only pixels the inferred modes (first color bar) that match ground truth (second color bar) with high probability. (c) visualizes the mode prediction of individual image states seen in the dataset. The location of the scattered dots indicates where the images are recorded while the colors show the predictions, which are well-aligned with mode boundaries.
  • ...and 5 more figures