Table of Contents
Fetching ...

How Can LLM Guide RL? A Value-Based Approach

Shenao Zhang, Sirui Zheng, Shuqi Ke, Zhihan Liu, Wanxin Jin, Jianbo Yuan, Yingxiang Yang, Hongxia Yang, Zhaoran Wang

TL;DR

This work tackles the challenge of sample-inefficient RL by leveraging Large Language Models (LLMs) as a policy prior rather than as a direct decision-maker. It introduces LINVIT, a KL-regularized, value-based RL framework that uses the LLM to guide exploration and regularize value functions, and SLINVIT, a practical variant with subgoal-based planning to reduce search complexity. Theoretical analysis shows the sample complexity scales with the divergence between the optimal policy and the LLM policy, enabling near-optimal learning when the LLM is informative, while experiments across ALFWorld, InterCode, and BlocksWorld demonstrate state-of-the-art performance and strong data efficiency. The combination of theoretical guarantees and empirical validation suggests a practical path to efficient, LLM-guided reinforcement learning in complex, interactive tasks.

Abstract

Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. However, RL algorithms may require extensive trial-and-error interactions to collect useful feedback for improvement. On the other hand, recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities for planning tasks, lacking the ability to autonomously refine their responses based on feedback. Therefore, in this paper, we study how the policy prior provided by the LLM can enhance the sample efficiency of RL algorithms. Specifically, we develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning, particularly when the difference between the ideal policy and the LLM-informed policy is small, which suggests that the initial policy is close to optimal, reducing the need for further exploration. Additionally, we present a practical algorithm SLINVIT that simplifies the construction of the value function and employs subgoals to reduce the search complexity. Our experiments across three interactive environments ALFWorld, InterCode, and BlocksWorld demonstrate that our method achieves state-of-the-art success rates and also surpasses previous RL and LLM approaches in terms of sample efficiency. Our code is available at https://github.com/agentification/Language-Integrated-VI.

How Can LLM Guide RL? A Value-Based Approach

TL;DR

This work tackles the challenge of sample-inefficient RL by leveraging Large Language Models (LLMs) as a policy prior rather than as a direct decision-maker. It introduces LINVIT, a KL-regularized, value-based RL framework that uses the LLM to guide exploration and regularize value functions, and SLINVIT, a practical variant with subgoal-based planning to reduce search complexity. Theoretical analysis shows the sample complexity scales with the divergence between the optimal policy and the LLM policy, enabling near-optimal learning when the LLM is informative, while experiments across ALFWorld, InterCode, and BlocksWorld demonstrate state-of-the-art performance and strong data efficiency. The combination of theoretical guarantees and empirical validation suggests a practical path to efficient, LLM-guided reinforcement learning in complex, interactive tasks.

Abstract

Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. However, RL algorithms may require extensive trial-and-error interactions to collect useful feedback for improvement. On the other hand, recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities for planning tasks, lacking the ability to autonomously refine their responses based on feedback. Therefore, in this paper, we study how the policy prior provided by the LLM can enhance the sample efficiency of RL algorithms. Specifically, we develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning, particularly when the difference between the ideal policy and the LLM-informed policy is small, which suggests that the initial policy is close to optimal, reducing the need for further exploration. Additionally, we present a practical algorithm SLINVIT that simplifies the construction of the value function and employs subgoals to reduce the search complexity. Our experiments across three interactive environments ALFWorld, InterCode, and BlocksWorld demonstrate that our method achieves state-of-the-art success rates and also surpasses previous RL and LLM approaches in terms of sample efficiency. Our code is available at https://github.com/agentification/Language-Integrated-VI.
Paper Structure (26 sections, 12 theorems, 71 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 26 sections, 12 theorems, 71 equations, 5 figures, 5 tables, 2 algorithms.

Key Result

Theorem 5.2

We assume that ${\mathrm{KL}}(\pi^*\Vert \pi^{\mathrm{LLM}})\leq \epsilon_{\mathrm{LLM}}$, and set the tuning parameter for some absolute constant $C$. We then have $V_1^*(s_1)-V_1^{\widehat{\pi}}(s_1)\leq \epsilon$ with probability as least $1-\delta$.

Figures (5)

  • Figure 1: Illustration of the differences and the respective advantages, disadvantages of RL and LLM agents in an instance of the ALFWorld decision-making task. We propose an RL framework leveraging the LLM as a policy prior that gets the best of both worlds.
  • Figure 2: Demonstration of the $\mathtt{SLINVIT}$ algorithm in the ALFWorld environment when $N=2$ and the tree breadth of BFS is set to $k=3$. The task is to "clean a cloth and put it on countertop". The hallucination that LLM faces, i.e., the towel should be taken (instead of cloth), is addressed by the inherent exploration mechanism in our RL framework.
  • Figure 3: Illustration of the proposed two instantiations of the value estimator.
  • Figure 4: Success rates with different numbers of samples.
  • Figure 5: Success rate (%) of $\mathtt{SLINVIT}$ and baselines on the 4-step and 6-step BlocksWorld tasks.

Theorems & Definitions (13)

  • Definition 5.1: KL-divergence between two policy
  • Theorem 5.2
  • Lemma A.1
  • Lemma A.2
  • Lemma B.1
  • Lemma B.2
  • Lemma B.3
  • Lemma B.4: Boundedness of $Q^\star_{\mathrm{LLM}, \lambda,h}$
  • Lemma B.5
  • Lemma B.6
  • ...and 3 more