AnyBipe: An End-to-End Framework for Training and Deploying Bipedal Robots Guided by Large Language Models

Yifei Yao; Wentao He; Chenyu Gu; Jiaheng Du; Fuwei Tan; Zhen Zhu; Junguo Lu

AnyBipe: An End-to-End Framework for Training and Deploying Bipedal Robots Guided by Large Language Models

Yifei Yao, Wentao He, Chenyu Gu, Jiaheng Du, Fuwei Tan, Zhen Zhu, Junguo Lu

TL;DR

The paper tackles the challenge of designing and deploying RL policies for bipedal robots, which is hindered by reward design complexity and a persistent sim-to-real gap. It proposes AnyBipe, an end-to-end framework guided by large language models, comprising three interconnected modules: (1) LLM-based reward design with structured prompt engineering, (2) RL training that leverages a reference policy to address cold-start convergence, and (3) a homomorphic evaluation loop that bridges simulation and real-world deployment. Key contributions include minimizing human intervention, integrating prior control knowledge, and enabling real-world feedback to refine rewards via homomorphic mapping, thereby accelerating convergence and improving safety in deployment. Experimental results across three robots and multiple LLMs show improvements in training stability, better sim-to-real alignment, and successful reality deployment, indicating strong potential for extending the approach to broader robotic tasks and settings.

Abstract

Training and deploying reinforcement learning (RL) policies for robots, especially in accomplishing specific tasks, presents substantial challenges. Recent advancements have explored diverse reward function designs, training techniques, simulation-to-reality (sim-to-real) transfers, and performance analysis methodologies, yet these still require significant human intervention. This paper introduces an end-to-end framework for training and deploying RL policies, guided by Large Language Models (LLMs), and evaluates its effectiveness on bipedal robots. The framework consists of three interconnected modules: an LLM-guided reward function design module, an RL training module leveraging prior work, and a sim-to-real homomorphic evaluation module. This design significantly reduces the need for human input by utilizing only essential simulation and deployment platforms, with the option to incorporate human-engineered strategies and historical data. We detail the construction of these modules, their advantages over traditional approaches, and demonstrate the framework's capability to autonomously develop and refine controlling strategies for bipedal robot locomotion, showcasing its potential to operate independently of human intervention.

AnyBipe: An End-to-End Framework for Training and Deploying Bipedal Robots Guided by Large Language Models

TL;DR

Abstract

Paper Structure (11 sections, 4 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 11 sections, 4 equations, 11 figures, 6 tables, 1 algorithm.

Introduction
Related Works
Methods
Module 1: LLM Guided Reward Function Design with Proper Prompt Engineering
Module 2: RL Training Adopting Reference Policy for Cold-Start
Module 3: Feedback From Simulation and Deployment Stage With Minimal Human Effort
Experiments
Experimental Setup
Module Analysis
Framework Analysis
Conclusion

Figures (11)

Figure 1: Our frameworks are organized in three interconnected modules. After receiving all pre-requisites and requirements, the framework generates reward function via LLM, trains it in simulation and evaluates in both simulation and reality, providing important feedback. The whole procedure requires minimum human labor.
Figure 2: Demonstration of LLM reward generation iterations. LLMs are encouraged to adopt CoT form output and adjust its generation results according to feedback. Each iteration requires generating $N= K$ reward function samples, they are separately trained and evaluated, and the best performance one is set as template for next LLM generation.
Figure 3: Homomorphic reward function conversion procedure.
Figure 4: Robot and DOF definitions.
Figure 6: Reward success, terrain level for teacher guided and original RL training with human-engineered rewards, using P1 as example. Blue line shows trending of training with teacher, and orange one is pure PPO.
...and 6 more figures

AnyBipe: An End-to-End Framework for Training and Deploying Bipedal Robots Guided by Large Language Models

TL;DR

Abstract

AnyBipe: An End-to-End Framework for Training and Deploying Bipedal Robots Guided by Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)