Table of Contents
Fetching ...

Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World

Yingzhao Jian, Zhongan Wang, Yi Yang, Hehe Fan

TL;DR

This work addresses the challenge of enabling humanoid agents to flexibly interact with open-world environments without costly task-specific data collection. It proposes BiBo, a framework that wires off-the-shelf Vision-Language Models to a diffusion-based motion executor through an embodied instruction compiler, translating high-level prompts into executable commands and refined joint trajectories conditioned on environmental feedback. The key contributions are the embodied instruction compiler with a three-stage VQA process, a diffusion-based motion executor that uses a Latent Diffusion Model and VAE to maintain continuity and adapt to dynamics, and extensive experiments showing strong single-interaction success and competitive long-horizon performance. BiBo demonstrates the potential of leveraging general-purpose VLMs for real-time humanoid control, offering a scalable path toward robust, data-efficient embodied AI systems, with future work aimed at explicit scene geometry modeling and broader interaction modalities.

Abstract

Humanoid agents often struggle to handle flexible and diverse interactions in open environments. A common solution is to collect massive datasets to train a highly capable model, but this approach can be prohibitively expensive. In this paper, we explore an alternative solution: empowering off-the-shelf Vision-Language Models (VLMs, such as GPT-4) to control humanoid agents, thereby leveraging their strong open-world generalization to mitigate the need for extensive data collection. To this end, we present \textbf{BiBo} (\textbf{B}uilding humano\textbf{I}d agent \textbf{B}y \textbf{O}ff-the-shelf VLMs). It consists of two key components: (1) an \textbf{embodied instruction compiler}, which enables the VLM to perceive the environment and precisely translate high-level user instructions (e.g., {\small\itshape ``have a rest''}) into low-level primitive commands with control parameters (e.g., {\small\itshape ``sit casually, location: (1, 2), facing: 90$^\circ$''}); and (2) a diffusion-based \textbf{motion executor}, which generates human-like motions from these commands, while dynamically adapting to physical feedback from the environment. In this way, BiBo is capable of handling not only basic interactions but also diverse and complex motions. Experiments demonstrate that BiBo achieves an interaction task success rate of 90.2\% in open environments, and improves the precision of text-guided motion execution by 16.3\% over prior methods. The code will be made publicly available.

Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World

TL;DR

This work addresses the challenge of enabling humanoid agents to flexibly interact with open-world environments without costly task-specific data collection. It proposes BiBo, a framework that wires off-the-shelf Vision-Language Models to a diffusion-based motion executor through an embodied instruction compiler, translating high-level prompts into executable commands and refined joint trajectories conditioned on environmental feedback. The key contributions are the embodied instruction compiler with a three-stage VQA process, a diffusion-based motion executor that uses a Latent Diffusion Model and VAE to maintain continuity and adapt to dynamics, and extensive experiments showing strong single-interaction success and competitive long-horizon performance. BiBo demonstrates the potential of leveraging general-purpose VLMs for real-time humanoid control, offering a scalable path toward robust, data-efficient embodied AI systems, with future work aimed at explicit scene geometry modeling and broader interaction modalities.

Abstract

Humanoid agents often struggle to handle flexible and diverse interactions in open environments. A common solution is to collect massive datasets to train a highly capable model, but this approach can be prohibitively expensive. In this paper, we explore an alternative solution: empowering off-the-shelf Vision-Language Models (VLMs, such as GPT-4) to control humanoid agents, thereby leveraging their strong open-world generalization to mitigate the need for extensive data collection. To this end, we present \textbf{BiBo} (\textbf{B}uilding humano\textbf{I}d agent \textbf{B}y \textbf{O}ff-the-shelf VLMs). It consists of two key components: (1) an \textbf{embodied instruction compiler}, which enables the VLM to perceive the environment and precisely translate high-level user instructions (e.g., {\small\itshape ``have a rest''}) into low-level primitive commands with control parameters (e.g., {\small\itshape ``sit casually, location: (1, 2), facing: 90''}); and (2) a diffusion-based \textbf{motion executor}, which generates human-like motions from these commands, while dynamically adapting to physical feedback from the environment. In this way, BiBo is capable of handling not only basic interactions but also diverse and complex motions. Experiments demonstrate that BiBo achieves an interaction task success rate of 90.2\% in open environments, and improves the precision of text-guided motion execution by 16.3\% over prior methods. The code will be made publicly available.

Paper Structure

This paper contains 72 sections, 9 equations, 18 figures, 17 tables, 1 algorithm.

Figures (18)

  • Figure 1: BiBo is a humanoid agent powered by an off-the-shelf VLM. It consists of an embodied instruction compiler (Inst. Compiler) and a diffusion-based motion executor. When the user provides a high-level instruction, the compiler observes the environment and translates it into the structured command for the executor. The executor then generates future motions for the humanoid agent, conditioned on both the command and the physical feedback from the environment. In this way, BiBo is able to perform diverse types of physical scene interactions.
  • Figure 2: The embodied instruction compiler takes in user instructions and environmental observations, and directs the VLM to generate the next motion command through a structured three-stage visual question–answering process. In the first stage, it analyzes the basic attributes of the motion (e.g., caption, key joints, target object). In the second stage, it reasons about the agent’s pose during the interaction. Finally, it specifies the target positions for the key joints.
  • Figure 3: The motion executor is a Latent Diffusion Model. When receiving the command (motion caption and control parameters) from the compiler, the Diffusion extends the future latents ${\bm{S}}_f$ from the actual executed motion tokens ${\bm{S}}_a$, conditioned on the command tokens ${\bm{s}}_m$ and ${\bm{s}}_c$. Then, the previous and newly generated latents are jointly decoded by the VAE decoder. The decoder use casual attention, where each motion frame or latent token can only attend to its preceding tokens or frames. After IK optimization, a tracking policy drive humanoid joints to execute the newly generated motion ${\bm{M}}_f$ in physical environment, producing the next execution result.
  • Figure 4: Summary of the random generated scene dataset. The tasks are constructed by a semi-automatic approach. The dataset contains various object categories, task types and difficulties, evaluating a wide range of interaction abilities of humanoid agents.
  • Figure 5: Visualization of executing results of comparison methods. Compared with BiBo, UniHSI generates less natural motions, while HumanVLA requires stricter initial positioning for transportation. MoConVQ shows limited motion activity, and CLoSD struggles to achieve precise control.
  • ...and 13 more figures