Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

Zihao Zhou; Bin Hu; Chenyang Zhao; Pu Zhang; Bin Liu

Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

Zihao Zhou, Bin Hu, Chenyang Zhao, Pu Zhang, Bin Liu

TL;DR

The paper tackles the cost and inefficiency of using large language models (LLMs) for embodied sequential decision-making by introducing LLM4Teach, a policy-distillation framework that trains a lightweight student RL agent from an LLM-based teacher. The student initially mimics the teacher using a distillation-like guidance and progressively shifts to learning from environment feedback, regulated by an annealing schedule that decays the teacher’s influence. Empirical results on MiniGrid and Habitat show that LLM4Teach achieves higher sample efficiency and often superior final performance compared to strong RL baselines, while requiring far smaller model sizes and avoiding test-time LLM interaction. This approach enables practical, edge-deployable embodied agents that leverage LLM reasoning during training but operate independently at deployment, with uncertainty-aware instructions further improving data efficiency.

Abstract

Recent studies have uncovered the potential of Large Language Models (LLMs) in addressing complex sequential decision-making tasks through the provision of high-level instructions. However, LLM-based agents lack specialization in tackling specific target problems, particularly in real-time dynamic environments. Additionally, deploying an LLM-based agent in practical scenarios can be both costly and time-consuming. On the other hand, reinforcement learning (RL) approaches train agents that specialize in the target task but often suffer from low sampling efficiency and high exploration costs. In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task. We conducted experiments on challenging MiniGrid and Habitat environments, specifically designed for embodied AI research, to evaluate the effectiveness of our framework. The results clearly demonstrate that our approach achieves superior performance compared to strong baseline methods. Our code is available at https://github.com/ZJLAB-AMMI/LLM4Teach.

Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

TL;DR

Abstract

Paper Structure (39 sections, 4 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 39 sections, 4 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Related Work
LLM-based Agents
LLM Assisted RL
Learning from Teacher Agents
LLM4Teach
The LLM4Teach Framework
On the LLM-based Teacher Agent
Generating Uncertainty-aware Instructions Using LLM
On the Learning Process of the Student Agent
Experiments
Simulation Platforms
MiniGrid
Habitat
Baseline Methods
...and 24 more sections

Figures (7)

Figure 1: An illustration of our LLM4Teach framework using the MiniGrid environment as an exemplar. The LLM-based teacher agent responds to observations of the state provided by the environment by offering soft instructions. These instructions take the form of a distribution over a set of suggested actions. The student agent is trained to optimize two objectives simultaneously. The first one is to maximize the expected return, the same as in traditional RL algorithms. The other one is to encourage the student agent to follow the guidance provided by the teacher. As the student agent's expertise increases during the training process, the weight assigned to the second objective gradually decreases over time, reducing its reliance on the teacher.
Figure 2: An example of a prefix prompt and an interaction between the student agent and the LLM-based teacher agent for the task ColoredDoorKey. The Prefix prompt consists of two blocks: the instruction block briefly introduces the target problem and the CoT reasoning process; and the example block provides one arbitrary example of the expected format of the response from the LLM.
Figure 3: The tested average returns (top row) and task completion success rates (bottom row) vs. the training iteration index of the compared methods across four environments. The dotted vertical line indicates the point at which the teacher's guidance is diminished, i.e., when $\lambda_i = 0$. LLM soly does not involve any learning, hence we report its average performance over 500 testing seeds, represented by a dashed horizontal line. For other approaches, we evaluate their policies every 10 iterations with 10 randomly generated testing seeds and report the averaged testing performance here. With our approach, the student agent effectively leverages the knowledge of the LLM-based teacher to bootstrap the early learning stage. Except for the SimpleDoorKey task, the student agent in LLM4Teach ultimately outperforms the LLM-based agent by learning from environment feedback through minimizing a traditional RL loss.
Figure 4: Ablation study on uncertainty-aware instructions. It shows that two types of uncertainty-aware instructions by the teacher both lead to improved sample efficiency for the student agent.
Figure 5: Habitat environment. Left: The visual observation from the onboard camera. Right: A view of the acting robot and its workspace from a third-party camera. Note that the third-party camera mentioned is purely for illustrative purposes and is not utilized during either the training or testing phases.
...and 2 more figures

Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

TL;DR

Abstract

Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (7)