Table of Contents
Fetching ...

Efficient Reinforcement Learning with Large Language Model Priors

Xue Yan, Yan Song, Xidong Feng, Mengyue Yang, Haifeng Zhang, Haitham Bou Ammar, Jun Wang

TL;DR

These experiments show that incorporating LLM-based action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90% in offline learning scenarios.

Abstract

In sequential decision-making (SDM) tasks, methods like reinforcement learning (RL) and heuristic search have made notable advances in specific cases. However, they often require extensive exploration and face challenges in generalizing across diverse environments due to their limited grasp of the underlying decision dynamics. In contrast, large language models (LLMs) have recently emerged as powerful general-purpose tools, due to their capacity to maintain vast amounts of domain-specific knowledge. To harness this rich prior knowledge for efficiently solving complex SDM tasks, we propose treating LLMs as prior action distributions and integrating them into RL frameworks through Bayesian inference methods, making use of variational inference and direct posterior sampling. The proposed approaches facilitate the seamless incorporation of fixed LLM priors into both policy-based and value-based RL frameworks. Our experiments show that incorporating LLM-based action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90% in offline learning scenarios.

Efficient Reinforcement Learning with Large Language Model Priors

TL;DR

These experiments show that incorporating LLM-based action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90% in offline learning scenarios.

Abstract

In sequential decision-making (SDM) tasks, methods like reinforcement learning (RL) and heuristic search have made notable advances in specific cases. However, they often require extensive exploration and face challenges in generalizing across diverse environments due to their limited grasp of the underlying decision dynamics. In contrast, large language models (LLMs) have recently emerged as powerful general-purpose tools, due to their capacity to maintain vast amounts of domain-specific knowledge. To harness this rich prior knowledge for efficiently solving complex SDM tasks, we propose treating LLMs as prior action distributions and integrating them into RL frameworks through Bayesian inference methods, making use of variational inference and direct posterior sampling. The proposed approaches facilitate the seamless incorporation of fixed LLM priors into both policy-based and value-based RL frameworks. Our experiments show that incorporating LLM-based action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90% in offline learning scenarios.

Paper Structure

This paper contains 44 sections, 2 theorems, 27 equations, 4 figures, 11 tables.

Key Result

Proposition 1

Denote that the above sampling strategy indeed follows a distribution of $q$. When $k\rightarrow \infty$, we have: The limiting policy corresponds to the policy that optimizes the Q-values with a KL regularizer: Then, the posterior sampling strategy is highly related to the solution of variational inference as shown in Eq. variation. Proof. Please see the appendix.

Figures (4)

  • Figure 1: An illustration of the process of approximate sampling from the intractable posterior: $q(a | s, \mathcal{O} = 1) \propto p(\mathcal{O}=1|a,s_t)p_\text{LLM}(a|s_t)$, implemented by reweighting the action prior proposals according to $Q$-values, which act as likelihood estimates. In our experiments, the $Q$ function adopts BERT to encode textual state-action pairs and output a scalar value through an adaptor network.
  • Figure 2: Results of comparison with online baselines. We plot the mean and standard error of the cumulative reward. For inference baselines, the reward is averaged over $20$ episodes. We plot the rewards averaged over the final third of the training processes for trainable baselines across five random seeds.
  • Figure 3: (a) Comparison of policy-based RL algorithms, we plot DQN-Prior and LLM-Prior for reference. (b) The ablation study on the number of LLM action proposals $k$ used to approximate the KL divergence. (c) The ablation of the KL coefficient of GFlan-Prior. (d) The ablation of the softmax temperature over Q values of the DQN-Prior
  • Figure 4: The ablation of the number of action proposals $k$ used for approximating KL divergence between the required action policy and the LLM prior action distribution.

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 1
  • proof