Enhancing Q-Learning with Large Language Model Heuristics

Xiefeng Wu

Enhancing Q-Learning with Large Language Model Heuristics

Xiefeng Wu

TL;DR

This paper addresses the inefficiency and potential bias of Q-learning in complex environments by introducing LLM-guided Q-learning, which augments the Q-function with a heuristic term $\hat{\mathbf{q}} = \mathbf{q} + \mathbf{h}$ to incorporate LLM-based guidance at the action level. It proposes offline and online guidance variants, extends TD3 to LLM-TD3, and provides a theoretical framework showing contraction, convergence to the buffered optimum $\mathbf{q}^*_D$, and a sample complexity bound $n = O\left( \frac{|\mathcal{S}|^2}{2\epsilon^2} \ln\frac{2|\mathcal{S} \times \mathcal{A}|}{\delta} \right)$. The analysis delves into suboptimality, and the impacts of hallucinations via overestimation and two forms of underestimation, demonstrating robustness and the ability to recover from incorrect guidance within finite steps. Empirical results across eight Gymnasium tasks show that LLM-TD3 achieves strong sample efficiency and broad generalizability, reducing ineffective exploration and often outperforming baselines without task-specific hyperparameter tuning, while acknowledging limitations in highly complex tasks such as Humanoid. Overall, the work advances practical RL by leveraging generative models as adaptable, online controllable heuristics that can be corrected interactively, enabling safer and faster learning in diverse environments.

Abstract

Q-learning excels in learning from feedback within sequential decision-making tasks but often requires extensive sampling to achieve significant improvements. While reward shaping can enhance learning efficiency, non-potential-based methods introduce biases that affect performance, and potential-based reward shaping, though unbiased, lacks the ability to provide heuristics for state-action pairs, limiting its effectiveness in complex environments. Large language models (LLMs) can achieve zero-shot learning for simpler tasks, but they suffer from low inference speeds and occasional hallucinations. To address these challenges, we propose \textbf{LLM-guided Q-learning}, a framework that leverages LLMs as heuristics to aid in learning the Q-function for reinforcement learning. Our theoretical analysis demonstrates that this approach adapts to hallucinations, improves sample efficiency, and avoids biasing final performance. Experimental results show that our algorithm is general, robust, and capable of preventing ineffective exploration.

Enhancing Q-Learning with Large Language Model Heuristics

TL;DR

This paper addresses the inefficiency and potential bias of Q-learning in complex environments by introducing LLM-guided Q-learning, which augments the Q-function with a heuristic term

to incorporate LLM-based guidance at the action level. It proposes offline and online guidance variants, extends TD3 to LLM-TD3, and provides a theoretical framework showing contraction, convergence to the buffered optimum

, and a sample complexity bound

. The analysis delves into suboptimality, and the impacts of hallucinations via overestimation and two forms of underestimation, demonstrating robustness and the ability to recover from incorrect guidance within finite steps. Empirical results across eight Gymnasium tasks show that LLM-TD3 achieves strong sample efficiency and broad generalizability, reducing ineffective exploration and often outperforming baselines without task-specific hyperparameter tuning, while acknowledging limitations in highly complex tasks such as Humanoid. Overall, the work advances practical RL by leveraging generative models as adaptable, online controllable heuristics that can be corrected interactively, enabling safer and faster learning in diverse environments.

Abstract

Paper Structure (35 sections, 4 theorems, 34 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 35 sections, 4 theorems, 34 equations, 5 figures, 8 tables, 2 algorithms.

Introduction
Related Work
Reward Shaping
LLM\\ VLM Agent
LLM-enhanced RL
LLM-guided Q-learning
Heuristic Q-learning Framework
Limitations of Action-Bonus Heuristics
Algorithm Implementation
Theoretical Analysis
Suboptimality Analysis
Impact of Hallucination
Underestimation on Non-Optimal Actions
Underestimation on Optimal Actions
Convergence Analysis
...and 20 more sections

Key Result

Theorem 1

Let $\hat{\mathbf{q}}$ be a contraction mapping defined in the metrics space $(\mathcal{X},\|\cdot\|_{\infty})$, i.e, , where $\mathcal{B}_{D}$ is the Bellman operator for the sampled MDP $D$ and $\gamma$ is the discount factor. Since both $\hat{\mathbf{q}}$ and $\mathbf{q}$ are updated on the same MDP, we have the following equation:

Figures (5)

Figure 1: Framework of LLM-guided Q-learning. The update sources for Q-learning include both collected experience and LLM-generated heuristics.
Figure 2: Overview of the Online Guidance Q-learning framework. External guidance can be provided at any training step, offering heuristic Q-values that influence policy decisions and improve sample efficiency.
Figure 3: Results of the adaptability test at different periods. The red horizontal line in the fourth figure represents optimal performance. The results indicate that after receiving incorrect heuristic values, our algorithm quickly recovers to its original performance levels.
Figure 4: Experimental results of different control tasks. Our proposed LLM-TD3 is able to converge quickly on different types of control tasks.
Figure 7: Hyperparameters of SAC

Theorems & Definitions (9)

proof
Theorem 1: Contraction and Equivalence of $\hat{\mathbf{q}}$
proof
Theorem 2: Convergence Sample Complexity
proof
Lemma 1: Decomposition
proof
Lemma 2: Convergence Bound
proof

Enhancing Q-Learning with Large Language Model Heuristics

TL;DR

Abstract

Enhancing Q-Learning with Large Language Model Heuristics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (9)