Learning Compositional Behaviors from Demonstration and Language
Weiyu Liu, Neil Nie, Ruohan Zhang, Jiayuan Mao, Jiajun Wu
TL;DR
BLADE addresses long-horizon robotic manipulation by learning language-grounded abstract actions from demonstrations and grounding them in perception, then planning with learned predicates in an abstract state space. It combines Behavior Description Learning with automatic predicate annotation and diffusion-based low-level policies, enabling bi-level planning that composes short-horizon skills for novel goals and perturbations. The approach demonstrates strong generalization in the CALVIN simulation and real-world tasks, outperforming latent planning and LLM/VLM baselines, and shows the value of automatic predicate grounding and explicit geometric/visibility preconditions. This yields more robust, scalable long-horizon planning for robot manipulation with limited labeled data and language guidance, bridging language, perception, and control.
Abstract
We introduce Behavior from Language and Demonstration (BLADE), a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning. BLADE leverages language-annotated demonstrations, extracts abstract action knowledge from large language models (LLMs), and constructs a library of structured, high-level action representations. These representations include preconditions and effects grounded in visual perception for each high-level action, along with corresponding controllers implemented as neural network-based policies. BLADE can recover such structured representations automatically, without manually labeled states or symbolic definitions. BLADE shows significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals. We validate the effectiveness of our approach both in simulation and on real robots with a diverse set of objects with articulated parts, partial observability, and geometric constraints.
