Table of Contents
Fetching ...

RoboCoder: Robotic Learning from Basic Skills to General Tasks with Large Language Models

Jingyao Li, Pengguang Chen, Sitong Wu, Chuanyang Zheng, Hong Xu, Jiaya Jia

TL;DR

This work tackles the limited generalization of robotic learning under single-task benchmarks by introducing RoboCoder, an autonomous framework that learns from basic skills to tackle complex tasks through an adaptive action-space. A new 80-task benchmark spanning 7 entities in IsaacGym is proposed to stress-test learning from minimal mastery, where GPT-4 achieves $47\%$ in three-shot humanoid scenarios. RoboCoder integrates a Searcher, Actor, and Evaluator to iteratively expand and refine executable action codes using real-time environmental feedback, achieving a $36\%$ relative improvement over GPT-4 for humanoids and up to $92\%$ in quadruped environments, while significantly improving inference speed via a lightweight action-searcher. The results validate the framework's robustness across diverse models and entities, suggesting meaningful benefits for open-world robotic manipulation and future real-world deployment, albeit with current limitations limited to simulated environments.\n

Abstract

The emergence of Large Language Models (LLMs) has improved the prospects for robotic tasks. However, existing benchmarks are still limited to single tasks with limited generalization capabilities. In this work, we introduce a comprehensive benchmark and an autonomous learning framework, RoboCoder aimed at enhancing the generalization capabilities of robots in complex environments. Unlike traditional methods that focus on single-task learning, our research emphasizes the development of a general-purpose robotic coding algorithm that enables robots to leverage basic skills to tackle increasingly complex tasks. The newly proposed benchmark consists of 80 manually designed tasks across 7 distinct entities, testing the models' ability to learn from minimal initial mastery. Initial testing revealed that even advanced models like GPT-4 could only achieve a 47% pass rate in three-shot scenarios with humanoid entities. To address these limitations, the RoboCoder framework integrates Large Language Models (LLMs) with a dynamic learning system that uses real-time environmental feedback to continuously update and refine action codes. This adaptive method showed a remarkable improvement, achieving a 36% relative improvement. Our codes will be released.

RoboCoder: Robotic Learning from Basic Skills to General Tasks with Large Language Models

TL;DR

This work tackles the limited generalization of robotic learning under single-task benchmarks by introducing RoboCoder, an autonomous framework that learns from basic skills to tackle complex tasks through an adaptive action-space. A new 80-task benchmark spanning 7 entities in IsaacGym is proposed to stress-test learning from minimal mastery, where GPT-4 achieves in three-shot humanoid scenarios. RoboCoder integrates a Searcher, Actor, and Evaluator to iteratively expand and refine executable action codes using real-time environmental feedback, achieving a relative improvement over GPT-4 for humanoids and up to in quadruped environments, while significantly improving inference speed via a lightweight action-searcher. The results validate the framework's robustness across diverse models and entities, suggesting meaningful benefits for open-world robotic manipulation and future real-world deployment, albeit with current limitations limited to simulated environments.\n

Abstract

The emergence of Large Language Models (LLMs) has improved the prospects for robotic tasks. However, existing benchmarks are still limited to single tasks with limited generalization capabilities. In this work, we introduce a comprehensive benchmark and an autonomous learning framework, RoboCoder aimed at enhancing the generalization capabilities of robots in complex environments. Unlike traditional methods that focus on single-task learning, our research emphasizes the development of a general-purpose robotic coding algorithm that enables robots to leverage basic skills to tackle increasingly complex tasks. The newly proposed benchmark consists of 80 manually designed tasks across 7 distinct entities, testing the models' ability to learn from minimal initial mastery. Initial testing revealed that even advanced models like GPT-4 could only achieve a 47% pass rate in three-shot scenarios with humanoid entities. To address these limitations, the RoboCoder framework integrates Large Language Models (LLMs) with a dynamic learning system that uses real-time environmental feedback to continuously update and refine action codes. This adaptive method showed a remarkable improvement, achieving a 36% relative improvement. Our codes will be released.
Paper Structure (37 sections, 4 equations, 28 figures, 9 tables, 2 algorithms)

This paper contains 37 sections, 4 equations, 28 figures, 9 tables, 2 algorithms.

Figures (28)

  • Figure 1: Diverse Simulation Environments showcasing models used for testing robotic and biomechanical frameworks. (a) Human: a bipedal humanoid model; (b) Ant: a multi-legged robotic creature; (c) Cartpole: a classic control system with a pole balanced on a moving cart; (d) Sektion Cabinet: a static storage unit with drawable components; (e) Franka Panda: a robotic arm with articulation for intricate tasks; (f) Kinova: a modular robotic arm with advanced maneuverability; (g) Anymal: a quadruped robot designed for versatile mobility across varying terrain.
  • Figure 2: Searcher: When a target task is input to the searcher, the searcher first retrieves the similarity within the action space's vector space to the target task. If the similarity exceeds the upper threshold, it is considered that the target task exists within the action space, and the corresponding action code is output. Actor: If the upper threshold is not exceeded, the searcher outputs $k$ action codes that surpass the lower threshold to the actor, who then improves and updates the input action codes to generate candidate actions that satisfy the target task. Evaluator: Finally, the evaluator examines the candidate actions, and if passed, the final action is output and updated in the action space. If not passed, the evaluator provides a solution for the actor to revise the candidate action until it meets the evaluator's standards or reaches the maximum number of iterations.
  • Figure 3: The actor and evaluator system prompts.
  • Figure 4: The templates of the action code.
  • Figure 5: Process of evaluation. When an action code is tested in the target environment, if the environment returns 1, the system outputs the code along with error messages. If the environment returns 0, it output images of the action operation. The evaluator receives the information and determines whether the target task has been successfully completed. If the target task is not successfully completed, a solution is provided to the actor for improvement.
  • ...and 23 more figures