Table of Contents
Fetching ...

Learning Compositional Behaviors from Demonstration and Language

Weiyu Liu, Neil Nie, Ruohan Zhang, Jiayuan Mao, Jiajun Wu

TL;DR

BLADE addresses long-horizon robotic manipulation by learning language-grounded abstract actions from demonstrations and grounding them in perception, then planning with learned predicates in an abstract state space. It combines Behavior Description Learning with automatic predicate annotation and diffusion-based low-level policies, enabling bi-level planning that composes short-horizon skills for novel goals and perturbations. The approach demonstrates strong generalization in the CALVIN simulation and real-world tasks, outperforming latent planning and LLM/VLM baselines, and shows the value of automatic predicate grounding and explicit geometric/visibility preconditions. This yields more robust, scalable long-horizon planning for robot manipulation with limited labeled data and language guidance, bridging language, perception, and control.

Abstract

We introduce Behavior from Language and Demonstration (BLADE), a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning. BLADE leverages language-annotated demonstrations, extracts abstract action knowledge from large language models (LLMs), and constructs a library of structured, high-level action representations. These representations include preconditions and effects grounded in visual perception for each high-level action, along with corresponding controllers implemented as neural network-based policies. BLADE can recover such structured representations automatically, without manually labeled states or symbolic definitions. BLADE shows significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals. We validate the effectiveness of our approach both in simulation and on real robots with a diverse set of objects with articulated parts, partial observability, and geometric constraints.

Learning Compositional Behaviors from Demonstration and Language

TL;DR

BLADE addresses long-horizon robotic manipulation by learning language-grounded abstract actions from demonstrations and grounding them in perception, then planning with learned predicates in an abstract state space. It combines Behavior Description Learning with automatic predicate annotation and diffusion-based low-level policies, enabling bi-level planning that composes short-horizon skills for novel goals and perturbations. The approach demonstrates strong generalization in the CALVIN simulation and real-world tasks, outperforming latent planning and LLM/VLM baselines, and shows the value of automatic predicate grounding and explicit geometric/visibility preconditions. This yields more robust, scalable long-horizon planning for robot manipulation with limited labeled data and language guidance, bridging language, perception, and control.

Abstract

We introduce Behavior from Language and Demonstration (BLADE), a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning. BLADE leverages language-annotated demonstrations, extracts abstract action knowledge from large language models (LLMs), and constructs a library of structured, high-level action representations. These representations include preconditions and effects grounded in visual perception for each high-level action, along with corresponding controllers implemented as neural network-based policies. BLADE can recover such structured representations automatically, without manually labeled states or symbolic definitions. BLADE shows significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals. We validate the effectiveness of our approach both in simulation and on real robots with a diverse set of objects with articulated parts, partial observability, and geometric constraints.

Paper Structure

This paper contains 29 sections, 12 figures, 5 tables, 2 algorithms.

Figures (12)

  • Figure 1: blade, a robot manipulation framework combining imitation learning and model-based planning. (a) blade takes language-annotated demonstrations as training data. (b) It generalizes to unseen initial conditions, state perturbations, and geometric constraints. (c) In the depicted scenarios, blade recovers from perturbations such as moving the kettle out of the sink, and resolves geometric constraints including a blocked stove.
  • Figure 2: Overview of blade. (a) blade receives language-annotated human demonstrations, (b) segments demonstrations into contact primitives, and learns a structured behavior representation. (c) It generalizes to novel conditions by leveraging bi-level planning and execution to achieve goal states.
  • Figure 3: Behavior Descriptions Learning. (a) A demonstration is provided along with corresponding language annotations. (b) The demonstration is segmented into a sequence of contact primitives. (c) A large language model interprets the annotation and contact sequence, generating a symbolic behavior definition. (d) The system automatically generates data to learn classifiers for state predicates.
  • Figure 4: Generalization Tasks in CALVIN. Examples from the three generalization tasks in the CALVIN simulation environment. Successfully completing these tasks require planning for and executing 3-7 actions.
  • Figure 5: Domains and Results in Real World.Make Tea features a toy kitchen designed to simulate boiling water on a stove. The robot must assess the available space on the stove for the kettle. It also needs to manage the dependencies between actions, such as the faucet must be turned away before the kettle can be placed into the sink to avoid collisions. Boil Water involves a tabletop task aimed at preparing tea, incorporating a cabinet, a drawer, and a stove. The robot must locate the kettle, potentially hidden within the cabinet, and a teabag in the drawer. Additionally, it must consider geometric constraints by removing obstacles that block the cabinet doors. In both environments, our model significantly outperforms the VLM-based planner Robot-VILA.
  • ...and 7 more figures