Table of Contents
Fetching ...

Trinity: A Modular Humanoid Robot AI System

Jingkai Sun, Qiang Zhang, Gang Han, Wen Zhao, Zhe Yong, Yan He, Jiaxu Wang, Jiahang Cao, Yijie Guo, Renjing Xu

TL;DR

Trinity tackles the challenge of versatile humanoid robotics by unifying RL-based locomotion, visual-language perception, and language-driven task planning in a modular, hierarchical architecture. The locomotion module uses Adversarial Motion Priors with an AMP discriminator and a periodic/regularization reward structure within an MDP $(\mathcal{S},\mathcal{A},\mathcal{R},p,\gamma)$ and a policy $\pi(a_t|s_t)$ to achieve human-like motion; a Finite State Machine manages gait transitions. The perception module employs ManipVQA to fuse RGB-D visual data with semantic queries, producing actionable representations for the LLM planner, which in turn composes a sequence of robot skills from Arm, Hand, and Body capabilities with kinematic-aware prompting. Real-world experiments on a full-scale humanoid and safety-focused evaluations demonstrate robust loco-manipulation under dynamic upper-body movements, and safety constraints are enforced through the LLM-driven planning layer, enabling safer operation in unstructured environments. Overall, Trinity demonstrates the feasibility and benefits of an integrated, modular humanoid AI stack that leverages multimodal perception, long-horizon reasoning, and robust motion control to operate effectively in complex real-world settings.

Abstract

In recent years, research on humanoid robots has garnered increasing attention. With breakthroughs in various types of artificial intelligence algorithms, embodied intelligence, exemplified by humanoid robots, has been highly anticipated. The advancements in reinforcement learning (RL) algorithms have significantly improved the motion control and generalization capabilities of humanoid robots. Simultaneously, the groundbreaking progress in large language models (LLM) and visual language models (VLM) has brought more possibilities and imagination to humanoid robots. LLM enables humanoid robots to understand complex tasks from language instructions and perform long-term task planning, while VLM greatly enhances the robots' understanding and interaction with their environment. This paper introduces \textcolor{magenta}{Trinity}, a novel AI system for humanoid robots that integrates RL, LLM, and VLM. By combining these technologies, Trinity enables efficient control of humanoid robots in complex environments. This innovative approach not only enhances the capabilities but also opens new avenues for future research and applications of humanoid robotics.

Trinity: A Modular Humanoid Robot AI System

TL;DR

Trinity tackles the challenge of versatile humanoid robotics by unifying RL-based locomotion, visual-language perception, and language-driven task planning in a modular, hierarchical architecture. The locomotion module uses Adversarial Motion Priors with an AMP discriminator and a periodic/regularization reward structure within an MDP and a policy to achieve human-like motion; a Finite State Machine manages gait transitions. The perception module employs ManipVQA to fuse RGB-D visual data with semantic queries, producing actionable representations for the LLM planner, which in turn composes a sequence of robot skills from Arm, Hand, and Body capabilities with kinematic-aware prompting. Real-world experiments on a full-scale humanoid and safety-focused evaluations demonstrate robust loco-manipulation under dynamic upper-body movements, and safety constraints are enforced through the LLM-driven planning layer, enabling safer operation in unstructured environments. Overall, Trinity demonstrates the feasibility and benefits of an integrated, modular humanoid AI stack that leverages multimodal perception, long-horizon reasoning, and robust motion control to operate effectively in complex real-world settings.

Abstract

In recent years, research on humanoid robots has garnered increasing attention. With breakthroughs in various types of artificial intelligence algorithms, embodied intelligence, exemplified by humanoid robots, has been highly anticipated. The advancements in reinforcement learning (RL) algorithms have significantly improved the motion control and generalization capabilities of humanoid robots. Simultaneously, the groundbreaking progress in large language models (LLM) and visual language models (VLM) has brought more possibilities and imagination to humanoid robots. LLM enables humanoid robots to understand complex tasks from language instructions and perform long-term task planning, while VLM greatly enhances the robots' understanding and interaction with their environment. This paper introduces \textcolor{magenta}{Trinity}, a novel AI system for humanoid robots that integrates RL, LLM, and VLM. By combining these technologies, Trinity enables efficient control of humanoid robots in complex environments. This innovative approach not only enhances the capabilities but also opens new avenues for future research and applications of humanoid robotics.

Paper Structure

This paper contains 13 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the Modular Humanoid Robot AI System. In this system, task instructions are processed by both a vision-language perception module and a large language model (LLM). The perception module, using input from an RGB-D camera, identifies the bounding box of the movable parts of an object. The system then utilizes the depth image to calculate the 3D position of the movable part, which is subsequently fed into the LLM-based task planner. To ensure optimal performance and safety, the task planner also integrates additional inputs: the task description, a skill library, workspace limitations, safety constraints, and prior kinematic knowledge. Once the task planner generates action commands, the humanoid robot’s controllers execute the command sequences to complete the task.
  • Figure 2: Overview of locomotion policy training. The state transitions sampled from demonstrations and generated by the policy are fed into a discriminator to calculate imitation reward. The policy receives the proprioception, command and periodic signal to output action.
  • Figure 3: Process of a humanoid robot opening a door. The humanoid robot begins by grasping the door handle, ensuring both feet are firmly planted on the floor. As the robot pulls the door, it encounters external forces and responds by lifting its right foot to maintain balance. Subsequently, the robot takes a step back. Finally, it lifts its left foot and steps back once more to achieve a stable stance.
  • Figure 4: Our policy enables the humanoid robot to maintain stability while standing, even during rapid upper body movements and height changes. This demonstrates the robot's ability to handle fast, dynamic arm motion sequences without losing balance.
  • Figure 5: Humanoid changes its height as command while carrying a load. The height pitch and roll curves are shown on the left. The robot can change its pose to adapt to different heights.
  • ...and 1 more figures