BBSEA: An Exploration of Brain-Body Synchronization for Embodied Agents
Sizhe Yang, Qian Luo, Anumpam Pani, Yanchao Yang
TL;DR
This paper addresses the challenge of autonomous learning for embodied agents without heavy human intervention by introducing BBSEA, a brain-body synchronization framework that couples Large Foundation Models with physical agents. The approach grounds task proposals in a scene graph derived from robust sensing, uses LLMs to generate diverse, feasible tasks, and employs a GPT-based success inference mechanism to enable continual skill acquisition via a language-conditioned policy learned through demonstrations. Through tabletop experiments, BBSEA demonstrates diverse, feasible task generation, accurate task completion feedback, and improved policy distillation, with notable zero-shot and adaptation capabilities as task variety increases. The work presents a scalable pathway for autonomously training embodied agents to perform complex physical interactions across novel tasks and configurations, reducing reliance on human input and enabling broader generalization.
Abstract
Embodied agents capable of complex physical skills can improve productivity, elevate life quality, and reshape human-machine collaboration. We aim at autonomous training of embodied agents for various tasks involving mainly large foundation models. It is believed that these models could act as a brain for embodied agents; however, existing methods heavily rely on humans for task proposal and scene customization, limiting the learning autonomy, training efficiency, and generalization of the learned policies. In contrast, we introduce a brain-body synchronization ({\it BBSEA}) scheme to promote embodied learning in unknown environments without human involvement. The proposed combines the wisdom of foundation models (``brain'') with the physical capabilities of embodied agents (``body''). Specifically, it leverages the ``brain'' to propose learnable physical tasks and success metrics, enabling the ``body'' to automatically acquire various skills by continuously interacting with the scene. We carry out an exploration of the proposed autonomous learning scheme in a table-top setting, and we demonstrate that the proposed synchronization can generate diverse tasks and develop multi-task policies with promising adaptability to new tasks and configurations. We will release our data, code, and trained models to facilitate future studies in building autonomously learning agents with large foundation models in more complex scenarios. More visualizations are available at \href{https://bbsea-embodied-ai.github.io}{https://bbsea-embodied-ai.github.io}
