Table of Contents
Fetching ...

Body of Her: A Preliminary Study on End-to-End Humanoid Agent

Tenglong Ao

TL;DR

This work introduces a unified end-to-end, duplex humanoid agent that jointly models speech, full-body movement, and manipulation by extending a pre-trained LLM with audio-visual modalities. It employs DAC for discrete audio tokens and a continuous video token representation with a diffusion head, conditioned by text prompts and motion trajectories, and trained with a two-stage fine-tuning plus RLHF. The system achieves real-time performance (~42 ms per frame at 24 fps) and demonstrates capabilities such as object manipulation and responsive dialogue, while acknowledging limitations in physics fidelity, identity maintenance, and scene complexity. Overall, the paper lays groundwork for scalable end-to-end humanoid agents and provides a data and training framework to advance multimodal, interactive world simulation.

Abstract

Interactive virtual humanoid agent is a crucial interface with the physical world. A relatively complete humanoid agent first needs to have face and body, then possess both verbal and non-verbal (such as eye contact, facial expression, lip motion, gesture, and manipulation) abilities, and finally, it is capable of real-time duplex communication, e.g., the ability to actively interrupt conversations. Most prior systems typically only consider a subset of these elements, leaving a gap from realistic humanoid agent. In this work, we propose a real-time, duplex, interactive end-to-end network capable of modeling realistic agent behaviors, including speech, full-body movements for talking, responding, idling, and manipulation. This system is a multimodal model integrating audio and visual inputs, extended from a pre-trained large language model (LLM). We collect approximately 200,000 hours of audio, around 130,000 hours of video data, and about 20,000 alignment samples to build the model. The final model demonstrates capabilities that are difficult to achieve in previous systems, such as generalized object manipulation. This work performs a preliminary exploration of the end-to-end approach in this field, aiming to inspire further research towards scaling up.

Body of Her: A Preliminary Study on End-to-End Humanoid Agent

TL;DR

This work introduces a unified end-to-end, duplex humanoid agent that jointly models speech, full-body movement, and manipulation by extending a pre-trained LLM with audio-visual modalities. It employs DAC for discrete audio tokens and a continuous video token representation with a diffusion head, conditioned by text prompts and motion trajectories, and trained with a two-stage fine-tuning plus RLHF. The system achieves real-time performance (~42 ms per frame at 24 fps) and demonstrates capabilities such as object manipulation and responsive dialogue, while acknowledging limitations in physics fidelity, identity maintenance, and scene complexity. Overall, the paper lays groundwork for scalable end-to-end humanoid agents and provides a data and training framework to advance multimodal, interactive world simulation.

Abstract

Interactive virtual humanoid agent is a crucial interface with the physical world. A relatively complete humanoid agent first needs to have face and body, then possess both verbal and non-verbal (such as eye contact, facial expression, lip motion, gesture, and manipulation) abilities, and finally, it is capable of real-time duplex communication, e.g., the ability to actively interrupt conversations. Most prior systems typically only consider a subset of these elements, leaving a gap from realistic humanoid agent. In this work, we propose a real-time, duplex, interactive end-to-end network capable of modeling realistic agent behaviors, including speech, full-body movements for talking, responding, idling, and manipulation. This system is a multimodal model integrating audio and visual inputs, extended from a pre-trained large language model (LLM). We collect approximately 200,000 hours of audio, around 130,000 hours of video data, and about 20,000 alignment samples to build the model. The final model demonstrates capabilities that are difficult to achieve in previous systems, such as generalized object manipulation. This work performs a preliminary exploration of the end-to-end approach in this field, aiming to inspire further research towards scaling up.
Paper Structure (17 sections, 7 equations, 3 figures)

This paper contains 17 sections, 7 equations, 3 figures.

Figures (3)

  • Figure 1: Our system continuously synthesizes the agent's voice and visual appearance based on multi-source streaming inputs, including the interlocutor's auditory and visual behaviors and specific control signals. The visual representation can be in video or 3D motion form, depending on rendering and computational power. Control signals use text descriptions for high-level behaviors like emotions and motion trajectories for low-level joint movement guidance.
  • Figure 2: A Transformer decoder $\mathcal{G}$ models the probability distribution of agent behaviors at the $i$-th frame, conditioned on previous behaviors of the agent and the human interlocutor, along with specific control signals, as follows: $p_{\mathcal{G}}(\bm{A}^{{\text{a}}}_{i}, \bm{V}^{{\text{a}}}_{i}|\bm{A}^{{\text{a}}}_{<i}, \bm{V}^{{\text{a}}}_{<i}, \bm{C}_{i})$, where $\bm{C}_{i}$$=$$[\bm{A}^{{\text{h}}}_{<i}, \bm{V}^{{\text{h}}}_{<i}, \bm{P}^{{\text{a}}}_{i}, \bm{r}^{{\text{a}}}_{i}, \bm{s}^{{\text{a}}}_{i}]$. $\bm{s}^{{\text{a}}}_{i}$ is a learnable embedding that specifies the agent's identity. The predicted agent audio $\bm{A}^{{\text{a}}*}_{i}$ and video $\bm{V}^{{\text{a}}*}_{i}$ are then sampled from this distribution.
  • Figure 3: Chain-of-thought (CoT) process with the vision-language model (VLM): given historical conversation context, i.e., image key-frames and transcript (transcribed from audio), (a) the description stage linguistically portrays the characteristics (e.g., gender, age, and appearance), emotion, behavior, environment (e.g., space, time, critical objects) of the human and the agent, respectively. (b) The analysis stage focuses on details of emotions, behaviors, and critical objects and their influence on the agent. (c) The planning stage makes final predictions of the future behavior, emotion, and trajectory of agent. The predicted text prompt and trajectory are utilized as control signals of $\mathcal{G}$ to affects agent's future behaviors.