Table of Contents
Fetching ...

Enhancing Q-Learning with Large Language Model Heuristics

Xiefeng Wu

TL;DR

This paper addresses the inefficiency and potential bias of Q-learning in complex environments by introducing LLM-guided Q-learning, which augments the Q-function with a heuristic term $\hat{\mathbf{q}} = \mathbf{q} + \mathbf{h}$ to incorporate LLM-based guidance at the action level. It proposes offline and online guidance variants, extends TD3 to LLM-TD3, and provides a theoretical framework showing contraction, convergence to the buffered optimum $\mathbf{q}^*_D$, and a sample complexity bound $n = O\left( \frac{|\mathcal{S}|^2}{2\epsilon^2} \ln\frac{2|\mathcal{S} \times \mathcal{A}|}{\delta} \right)$. The analysis delves into suboptimality, and the impacts of hallucinations via overestimation and two forms of underestimation, demonstrating robustness and the ability to recover from incorrect guidance within finite steps. Empirical results across eight Gymnasium tasks show that LLM-TD3 achieves strong sample efficiency and broad generalizability, reducing ineffective exploration and often outperforming baselines without task-specific hyperparameter tuning, while acknowledging limitations in highly complex tasks such as Humanoid. Overall, the work advances practical RL by leveraging generative models as adaptable, online controllable heuristics that can be corrected interactively, enabling safer and faster learning in diverse environments.

Abstract

Q-learning excels in learning from feedback within sequential decision-making tasks but often requires extensive sampling to achieve significant improvements. While reward shaping can enhance learning efficiency, non-potential-based methods introduce biases that affect performance, and potential-based reward shaping, though unbiased, lacks the ability to provide heuristics for state-action pairs, limiting its effectiveness in complex environments. Large language models (LLMs) can achieve zero-shot learning for simpler tasks, but they suffer from low inference speeds and occasional hallucinations. To address these challenges, we propose \textbf{LLM-guided Q-learning}, a framework that leverages LLMs as heuristics to aid in learning the Q-function for reinforcement learning. Our theoretical analysis demonstrates that this approach adapts to hallucinations, improves sample efficiency, and avoids biasing final performance. Experimental results show that our algorithm is general, robust, and capable of preventing ineffective exploration.

Enhancing Q-Learning with Large Language Model Heuristics

TL;DR

This paper addresses the inefficiency and potential bias of Q-learning in complex environments by introducing LLM-guided Q-learning, which augments the Q-function with a heuristic term to incorporate LLM-based guidance at the action level. It proposes offline and online guidance variants, extends TD3 to LLM-TD3, and provides a theoretical framework showing contraction, convergence to the buffered optimum , and a sample complexity bound . The analysis delves into suboptimality, and the impacts of hallucinations via overestimation and two forms of underestimation, demonstrating robustness and the ability to recover from incorrect guidance within finite steps. Empirical results across eight Gymnasium tasks show that LLM-TD3 achieves strong sample efficiency and broad generalizability, reducing ineffective exploration and often outperforming baselines without task-specific hyperparameter tuning, while acknowledging limitations in highly complex tasks such as Humanoid. Overall, the work advances practical RL by leveraging generative models as adaptable, online controllable heuristics that can be corrected interactively, enabling safer and faster learning in diverse environments.

Abstract

Q-learning excels in learning from feedback within sequential decision-making tasks but often requires extensive sampling to achieve significant improvements. While reward shaping can enhance learning efficiency, non-potential-based methods introduce biases that affect performance, and potential-based reward shaping, though unbiased, lacks the ability to provide heuristics for state-action pairs, limiting its effectiveness in complex environments. Large language models (LLMs) can achieve zero-shot learning for simpler tasks, but they suffer from low inference speeds and occasional hallucinations. To address these challenges, we propose \textbf{LLM-guided Q-learning}, a framework that leverages LLMs as heuristics to aid in learning the Q-function for reinforcement learning. Our theoretical analysis demonstrates that this approach adapts to hallucinations, improves sample efficiency, and avoids biasing final performance. Experimental results show that our algorithm is general, robust, and capable of preventing ineffective exploration.
Paper Structure (35 sections, 4 theorems, 34 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 35 sections, 4 theorems, 34 equations, 5 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

Let $\hat{\mathbf{q}}$ be a contraction mapping defined in the metrics space $(\mathcal{X},\|\cdot\|_{\infty})$, i.e, , where $\mathcal{B}_{D}$ is the Bellman operator for the sampled MDP $D$ and $\gamma$ is the discount factor. Since both $\hat{\mathbf{q}}$ and $\mathbf{q}$ are updated on the same MDP, we have the following equation:

Figures (5)

  • Figure 1: Framework of LLM-guided Q-learning. The update sources for Q-learning include both collected experience and LLM-generated heuristics.
  • Figure 2: Overview of the Online Guidance Q-learning framework. External guidance can be provided at any training step, offering heuristic Q-values that influence policy decisions and improve sample efficiency.
  • Figure 3: Results of the adaptability test at different periods. The red horizontal line in the fourth figure represents optimal performance. The results indicate that after receiving incorrect heuristic values, our algorithm quickly recovers to its original performance levels.
  • Figure 4: Experimental results of different control tasks. Our proposed LLM-TD3 is able to converge quickly on different types of control tasks.
  • Figure 7: Hyperparameters of SAC

Theorems & Definitions (9)

  • proof
  • Theorem 1: Contraction and Equivalence of $\hat{\mathbf{q}}$
  • proof
  • Theorem 2: Convergence Sample Complexity
  • proof
  • Lemma 1: Decomposition
  • proof
  • Lemma 2: Convergence Bound
  • proof