Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World
Yingzhao Jian, Zhongan Wang, Yi Yang, Hehe Fan
TL;DR
This work addresses the challenge of enabling humanoid agents to flexibly interact with open-world environments without costly task-specific data collection. It proposes BiBo, a framework that wires off-the-shelf Vision-Language Models to a diffusion-based motion executor through an embodied instruction compiler, translating high-level prompts into executable commands and refined joint trajectories conditioned on environmental feedback. The key contributions are the embodied instruction compiler with a three-stage VQA process, a diffusion-based motion executor that uses a Latent Diffusion Model and VAE to maintain continuity and adapt to dynamics, and extensive experiments showing strong single-interaction success and competitive long-horizon performance. BiBo demonstrates the potential of leveraging general-purpose VLMs for real-time humanoid control, offering a scalable path toward robust, data-efficient embodied AI systems, with future work aimed at explicit scene geometry modeling and broader interaction modalities.
Abstract
Humanoid agents often struggle to handle flexible and diverse interactions in open environments. A common solution is to collect massive datasets to train a highly capable model, but this approach can be prohibitively expensive. In this paper, we explore an alternative solution: empowering off-the-shelf Vision-Language Models (VLMs, such as GPT-4) to control humanoid agents, thereby leveraging their strong open-world generalization to mitigate the need for extensive data collection. To this end, we present \textbf{BiBo} (\textbf{B}uilding humano\textbf{I}d agent \textbf{B}y \textbf{O}ff-the-shelf VLMs). It consists of two key components: (1) an \textbf{embodied instruction compiler}, which enables the VLM to perceive the environment and precisely translate high-level user instructions (e.g., {\small\itshape ``have a rest''}) into low-level primitive commands with control parameters (e.g., {\small\itshape ``sit casually, location: (1, 2), facing: 90$^\circ$''}); and (2) a diffusion-based \textbf{motion executor}, which generates human-like motions from these commands, while dynamically adapting to physical feedback from the environment. In this way, BiBo is capable of handling not only basic interactions but also diverse and complex motions. Experiments demonstrate that BiBo achieves an interaction task success rate of 90.2\% in open environments, and improves the precision of text-guided motion execution by 16.3\% over prior methods. The code will be made publicly available.
