Table of Contents
Fetching ...

BBSEA: An Exploration of Brain-Body Synchronization for Embodied Agents

Sizhe Yang, Qian Luo, Anumpam Pani, Yanchao Yang

TL;DR

This paper addresses the challenge of autonomous learning for embodied agents without heavy human intervention by introducing BBSEA, a brain-body synchronization framework that couples Large Foundation Models with physical agents. The approach grounds task proposals in a scene graph derived from robust sensing, uses LLMs to generate diverse, feasible tasks, and employs a GPT-based success inference mechanism to enable continual skill acquisition via a language-conditioned policy learned through demonstrations. Through tabletop experiments, BBSEA demonstrates diverse, feasible task generation, accurate task completion feedback, and improved policy distillation, with notable zero-shot and adaptation capabilities as task variety increases. The work presents a scalable pathway for autonomously training embodied agents to perform complex physical interactions across novel tasks and configurations, reducing reliance on human input and enabling broader generalization.

Abstract

Embodied agents capable of complex physical skills can improve productivity, elevate life quality, and reshape human-machine collaboration. We aim at autonomous training of embodied agents for various tasks involving mainly large foundation models. It is believed that these models could act as a brain for embodied agents; however, existing methods heavily rely on humans for task proposal and scene customization, limiting the learning autonomy, training efficiency, and generalization of the learned policies. In contrast, we introduce a brain-body synchronization ({\it BBSEA}) scheme to promote embodied learning in unknown environments without human involvement. The proposed combines the wisdom of foundation models (``brain'') with the physical capabilities of embodied agents (``body''). Specifically, it leverages the ``brain'' to propose learnable physical tasks and success metrics, enabling the ``body'' to automatically acquire various skills by continuously interacting with the scene. We carry out an exploration of the proposed autonomous learning scheme in a table-top setting, and we demonstrate that the proposed synchronization can generate diverse tasks and develop multi-task policies with promising adaptability to new tasks and configurations. We will release our data, code, and trained models to facilitate future studies in building autonomously learning agents with large foundation models in more complex scenarios. More visualizations are available at \href{https://bbsea-embodied-ai.github.io}{https://bbsea-embodied-ai.github.io}

BBSEA: An Exploration of Brain-Body Synchronization for Embodied Agents

TL;DR

This paper addresses the challenge of autonomous learning for embodied agents without heavy human intervention by introducing BBSEA, a brain-body synchronization framework that couples Large Foundation Models with physical agents. The approach grounds task proposals in a scene graph derived from robust sensing, uses LLMs to generate diverse, feasible tasks, and employs a GPT-based success inference mechanism to enable continual skill acquisition via a language-conditioned policy learned through demonstrations. Through tabletop experiments, BBSEA demonstrates diverse, feasible task generation, accurate task completion feedback, and improved policy distillation, with notable zero-shot and adaptation capabilities as task variety increases. The work presents a scalable pathway for autonomously training embodied agents to perform complex physical interactions across novel tasks and configurations, reducing reliance on human input and enabling broader generalization.

Abstract

Embodied agents capable of complex physical skills can improve productivity, elevate life quality, and reshape human-machine collaboration. We aim at autonomous training of embodied agents for various tasks involving mainly large foundation models. It is believed that these models could act as a brain for embodied agents; however, existing methods heavily rely on humans for task proposal and scene customization, limiting the learning autonomy, training efficiency, and generalization of the learned policies. In contrast, we introduce a brain-body synchronization ({\it BBSEA}) scheme to promote embodied learning in unknown environments without human involvement. The proposed combines the wisdom of foundation models (``brain'') with the physical capabilities of embodied agents (``body''). Specifically, it leverages the ``brain'' to propose learnable physical tasks and success metrics, enabling the ``body'' to automatically acquire various skills by continuously interacting with the scene. We carry out an exploration of the proposed autonomous learning scheme in a table-top setting, and we demonstrate that the proposed synchronization can generate diverse tasks and develop multi-task policies with promising adaptability to new tasks and configurations. We will release our data, code, and trained models to facilitate future studies in building autonomously learning agents with large foundation models in more complex scenarios. More visualizations are available at \href{https://bbsea-embodied-ai.github.io}{https://bbsea-embodied-ai.github.io}
Paper Structure (36 sections, 1 equation, 16 figures, 7 tables)

This paper contains 36 sections, 1 equation, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Left: An experienced elder instructs a toddler on playing with colorful blocks (image credit to GPT-4V). Right: Using the proposed brain-body synchronization scheme, foundation models ( brain) teach an embodied agent ( body) a variety of physical interaction skills. To accomplish this, the brain needs to propose interaction tasks that are compatible with the scene and the body's physical constraints, as well as define measurable success metrics for the suggested tasks. The body then synchronizes with the brain through trial and error, acquiring interaction skills solely based on feedback from the brain.
  • Figure 2: An overview of the proposed brain-body synchronization. The scene comprehension module constructs and passes a scene graph of the current environment to an LFM ( brain). The brain then proposes interaction tasks compatible with the scene and the physical limitations of the body, which acquires the interaction skills via trial and error with solely the feedback from the brain.
  • Figure 3: Comparison between proposed tasks through GPT-4V (GPT4+Vision) and the task proposer in our pipeline. Ours can propose more context-relevant and feasible tasks for the agent to learn, leveraging easily digestible scene information in the graph.
  • Figure 4: An overview of the prompts used for the Task Proposer (left), Task Decomposer (middle), and the Success Inference (right) modules. These prompts ensure effective collection and completion inference of diverse and feasible tasks. Please note that all the prompts are fixed without tailoring to a specific task.
  • Figure 5: Multidimensional Scaling (MDS) is performed on the (human-endorsed) distance matrix to obtain a 2D plot of the tasks, which are further clustered by K-Means (left). From the clusters, a few tasks are chosen and highlighted (middle). MDS is performed again but with the text embeddings of the tasks, however, the clusters from Fig. \ref{['fig:cluster-ours']} are now overlapped with each other, evidencing that text-embeddings may not characterize the task space compatible with human understanding.
  • ...and 11 more figures