Enhancing Q-Learning with Large Language Model Heuristics
Xiefeng Wu
TL;DR
This paper addresses the inefficiency and potential bias of Q-learning in complex environments by introducing LLM-guided Q-learning, which augments the Q-function with a heuristic term $\hat{\mathbf{q}} = \mathbf{q} + \mathbf{h}$ to incorporate LLM-based guidance at the action level. It proposes offline and online guidance variants, extends TD3 to LLM-TD3, and provides a theoretical framework showing contraction, convergence to the buffered optimum $\mathbf{q}^*_D$, and a sample complexity bound $n = O\left( \frac{|\mathcal{S}|^2}{2\epsilon^2} \ln\frac{2|\mathcal{S} \times \mathcal{A}|}{\delta} \right)$. The analysis delves into suboptimality, and the impacts of hallucinations via overestimation and two forms of underestimation, demonstrating robustness and the ability to recover from incorrect guidance within finite steps. Empirical results across eight Gymnasium tasks show that LLM-TD3 achieves strong sample efficiency and broad generalizability, reducing ineffective exploration and often outperforming baselines without task-specific hyperparameter tuning, while acknowledging limitations in highly complex tasks such as Humanoid. Overall, the work advances practical RL by leveraging generative models as adaptable, online controllable heuristics that can be corrected interactively, enabling safer and faster learning in diverse environments.
Abstract
Q-learning excels in learning from feedback within sequential decision-making tasks but often requires extensive sampling to achieve significant improvements. While reward shaping can enhance learning efficiency, non-potential-based methods introduce biases that affect performance, and potential-based reward shaping, though unbiased, lacks the ability to provide heuristics for state-action pairs, limiting its effectiveness in complex environments. Large language models (LLMs) can achieve zero-shot learning for simpler tasks, but they suffer from low inference speeds and occasional hallucinations. To address these challenges, we propose \textbf{LLM-guided Q-learning}, a framework that leverages LLMs as heuristics to aid in learning the Q-function for reinforcement learning. Our theoretical analysis demonstrates that this approach adapts to hallucinations, improves sample efficiency, and avoids biasing final performance. Experimental results show that our algorithm is general, robust, and capable of preventing ineffective exploration.
