Table of Contents
Fetching ...

Prompt, Plan, Perform: LLM-based Humanoid Control via Quantized Imitation Learning

Jingkai Sun, Qiang Zhang, Yiqun Duan, Xiaoyang Jiang, Chong Cheng, Renjing Xu

TL;DR

The paper addresses the challenge of enabling humanoid robots to perform unseen tasks by uniting Generative Adversarial Imitation Learning with large language model planning. It introduces a single-policy framework augmented by a CLIP-based encoder and codebook-based vector quantization, guided by an LLM planner to sequence reusable skills with a general directional reward. The approach reduces manual reward engineering and high-level policy design, achieving zero-shot task execution in obstacle-rich scenarios. Experiments in simulation demonstrate efficient adaptation and robustness to stochastic LLM outputs, while noting limitations due to the idealized data and simulation environment that warrant future real-world validation.

Abstract

In recent years, reinforcement learning and imitation learning have shown great potential for controlling humanoid robots' motion. However, these methods typically create simulation environments and rewards for specific tasks, resulting in the requirements of multiple policies and limited capabilities for tackling complex and unknown tasks. To overcome these issues, we present a novel approach that combines adversarial imitation learning with large language models (LLMs). This innovative method enables the agent to learn reusable skills with a single policy and solve zero-shot tasks under the guidance of LLMs. In particular, we utilize the LLM as a strategic planner for applying previously learned skills to novel tasks through the comprehension of task-specific prompts. This empowers the robot to perform the specified actions in a sequence. To improve our model, we incorporate codebook-based vector quantization, allowing the agent to generate suitable actions in response to unseen textual commands from LLMs. Furthermore, we design general reward functions that consider the distinct motion features of humanoid robots, ensuring the agent imitates the motion data while maintaining goal orientation without additional guiding direction approaches or policies. To the best of our knowledge, this is the first framework that controls humanoid robots using a single learning policy network and LLM as a planner. Extensive experiments demonstrate that our method exhibits efficient and adaptive ability in complicated motion tasks.

Prompt, Plan, Perform: LLM-based Humanoid Control via Quantized Imitation Learning

TL;DR

The paper addresses the challenge of enabling humanoid robots to perform unseen tasks by uniting Generative Adversarial Imitation Learning with large language model planning. It introduces a single-policy framework augmented by a CLIP-based encoder and codebook-based vector quantization, guided by an LLM planner to sequence reusable skills with a general directional reward. The approach reduces manual reward engineering and high-level policy design, achieving zero-shot task execution in obstacle-rich scenarios. Experiments in simulation demonstrate efficient adaptation and robustness to stochastic LLM outputs, while noting limitations due to the idealized data and simulation environment that warrant future real-world validation.

Abstract

In recent years, reinforcement learning and imitation learning have shown great potential for controlling humanoid robots' motion. However, these methods typically create simulation environments and rewards for specific tasks, resulting in the requirements of multiple policies and limited capabilities for tackling complex and unknown tasks. To overcome these issues, we present a novel approach that combines adversarial imitation learning with large language models (LLMs). This innovative method enables the agent to learn reusable skills with a single policy and solve zero-shot tasks under the guidance of LLMs. In particular, we utilize the LLM as a strategic planner for applying previously learned skills to novel tasks through the comprehension of task-specific prompts. This empowers the robot to perform the specified actions in a sequence. To improve our model, we incorporate codebook-based vector quantization, allowing the agent to generate suitable actions in response to unseen textual commands from LLMs. Furthermore, we design general reward functions that consider the distinct motion features of humanoid robots, ensuring the agent imitates the motion data while maintaining goal orientation without additional guiding direction approaches or policies. To the best of our knowledge, this is the first framework that controls humanoid robots using a single learning policy network and LLM as a planner. Extensive experiments demonstrate that our method exhibits efficient and adaptive ability in complicated motion tasks.
Paper Structure (15 sections, 7 equations, 6 figures, 2 tables)

This paper contains 15 sections, 7 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our framework enables robots to combine the skills obtained from imitation learning with the planning capabilities of LLMs to accomplish complex tasks. For example, with known obstacles as well as its own coordinates, the robot accomplishes the task of hitting a target after avoiding obstacles by scheduling reusable skills.
  • Figure 2: Overview of our proposed system. Motion captions with the same semantics are first clustered together by fine-tuning the CLIP Text encoder. Subsequently, the output text features are fed into the policy training by codebook-based vector quantization. Our pre-training system feeds a reference dataset defining the desired underlying motion and its corresponding text labels (marked in red in the figure) into the training discriminator to provide discriminator rewards for policy training. The discriminator reward is then combined with the task reward for controlling orientation to train a policy that allows the robot to execute the demonstrated motion in the specified orientation. These two processes are not trained at the same time.
  • Figure 3: Evaluation system overview. The Human definitions are fed into the LLMs as prompts. The LLMs output the sequence of actions and the target orientation of each action. The text of the action sequence is input to the CLIP Text encoder, and the target orientation is input to the policy as an observation concatenated with the observation given by the environment.
  • Figure 4: Initialization of an obstacle avoidance attack task. The gray rectangle represents the attack target, the blue markers are the middle path point of the LLMs plan, the red is the obstacle, and the green line points to the current target orientation of the robot.
  • Figure 5: The movements and their target orientations for each step in completing the obstacle avoidance attack task. The example shows the initial position of the robot (0,0), obstacle position (3,0), and target position (6,0). The obstacle is a rectangle with a length of 1.2m in the x-axis direction and a length of 1.8m in the y-axis direction. The blue markers are the waypoints of the LLMs plan, the red is the obstacle, and the green line points to the current target orientation of the robot.
  • ...and 1 more figures