RLingua: Improving Reinforcement Learning Sample Efficiency in Robotic Manipulations With Large Language Models

Liangliang Chen; Yutian Lei; Shiyu Jin; Ying Zhang; Liangjun Zhang

RLingua: Improving Reinforcement Learning Sample Efficiency in Robotic Manipulations With Large Language Models

Liangliang Chen, Yutian Lei, Shiyu Jin, Ying Zhang, Liangjun Zhang

TL;DR

RLingua is proposed, a framework that can leverage the internal knowledge of large language models (LLMs) to reduce the sample complexity of RL in robotic manipulations and provides a novel method of improving the imperfect LLM-generated robot controllers by RL.

Abstract

Reinforcement learning (RL) has demonstrated its capability in solving various tasks but is notorious for its low sample efficiency. In this paper, we propose RLingua, a framework that can leverage the internal knowledge of large language models (LLMs) to reduce the sample complexity of RL in robotic manipulations. To this end, we first present a method for extracting the prior knowledge of LLMs by prompt engineering so that a preliminary rule-based robot controller for a specific task can be generated in a user-friendly manner. Despite being imperfect, the LLM-generated robot controller is utilized to produce action samples during rollouts with a decaying probability, thereby improving RL's sample efficiency. We employ TD3, the widely-used RL baseline method, and modify the actor loss to regularize the policy learning towards the LLM-generated controller. RLingua also provides a novel method of improving the imperfect LLM-generated robot controllers by RL. We demonstrate that RLingua can significantly reduce the sample complexity of TD3 in four robot tasks of panda_gym and achieve high success rates in 12 sampled sparsely rewarded robot tasks in RLBench, where the standard TD3 fails. Additionally, We validated RLingua's effectiveness in real-world robot experiments through Sim2Real, demonstrating that the learned policies are effectively transferable to real robot tasks. Further details about our work are available at our project website https://rlingua.github.io.

RLingua: Improving Reinforcement Learning Sample Efficiency in Robotic Manipulations With Large Language Models

TL;DR

Abstract

Paper Structure (25 sections, 6 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 8 figures, 2 tables, 1 algorithm.

Introduction
Preliminaries
Reinforcement Learning
Problem Descriptions
Methods of RLingua
LLM Prompt Design
Prompt Design with Human Feedback
Prompt Design With A Code Template
Reinforcement Learning With the LLM Controller
Experiments
Simulations in the panda_gym Environment
Simulations in the RLBench Environment
Real Robot Experiments
Conclusions
Pseudo-code of RLingua With TD3
...and 10 more sections

Figures (8)

Figure 1: RLingua extracts the LLM's knowledge about robot motion to improve the sample efficiency of RL. (a) Motivation: LLMs do not need environment samples and are easy to communicate for non-experts. However, the robot controllers generated directly by LLMs may have inferior performance. In contrast, RL can be used to train robot controllers to achieve high performance. However, the cost of RL is its high sample complexity. (b) Framework: RLingua extracts the internal knowledge of LLMs about robot motion to a coded imperfect controller, which is then used to collect data by interaction with the environment. The robot control policy is trained with both the collected LLM demonstration data and the interaction data collected by the online training policy. The collected demonstration data end up in the LLM buffer $R_{\mathrm{LLM}}$ and are directly used in imitation learning. They influence RL indirectly by shaping the policy and thus changing the collected samples in the RL buffer $R_{\mathrm{RL}}$.
Figure 2: The framework of prompt design with human feedback. The task descriptions and code guidelines are prompted in sequence. The human feedback is provided after observing the preliminary LLM controller execution process on the robot.
Figure 3: The framework of prompt design with a code template
Figure 4: The success rates of different tasks in panda_gym with respect to numbers of environment samples. The solid line is the mean success rate and the shaded region represents the minimum and maximum success rates, both evaluated with four different random seeds. The exponential moving average with a smoothing factor of 0.95 is applied to all curves.
Figure 5: The success rates of different tasks in RLBench with respect to numbers of environment samples. The solid line is the mean success rate and the shaded region represents the minimum and maximum success rates, both evaluated with four different random seeds. The exponential moving average with a smoothing factor of 0.95 is applied to all curves.
...and 3 more figures

RLingua: Improving Reinforcement Learning Sample Efficiency in Robotic Manipulations With Large Language Models

TL;DR

Abstract

RLingua: Improving Reinforcement Learning Sample Efficiency in Robotic Manipulations With Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)