From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge

Xiefeng Wu

From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge

Xiefeng Wu

TL;DR

It is established that Q-shaping is a superior and unbiased alternative to conventional reward shaping in reinforcement learning and significantly enhances sample efficiency.

Abstract

Q-shaping is an extension of Q-value initialization and serves as an alternative to reward shaping for incorporating domain knowledge to accelerate agent training, thereby improving sample efficiency by directly shaping Q-values. This approach is both general and robust across diverse tasks, allowing for immediate impact assessment while guaranteeing optimality. We evaluated Q-shaping across 20 different environments using a large language model (LLM) as the heuristic provider. The results demonstrate that Q-shaping significantly enhances sample efficiency, achieving a \textbf{16.87\%} improvement over the best baseline in each environment and a \textbf{253.80\%} improvement compared to LLM-based reward shaping methods. These findings establish Q-shaping as a superior and unbiased alternative to conventional reward shaping in reinforcement learning.

From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge

TL;DR

It is established that Q-shaping is a superior and unbiased alternative to conventional reward shaping in reinforcement learning and significantly enhances sample efficiency.

Abstract

Paper Structure (28 sections, 2 theorems, 6 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 28 sections, 2 theorems, 6 equations, 7 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Heuristic Reinforcement Learning
LLM\\ VLM Agent
LLM-enhanced RL
Notation
Markov Decision Processes.
Datasets
Convergence
Q-shaping Framework
Unbiased Optimality
Utilizing Imprecise Q value Estimation
Underestimation of Non-Optimal Actions
Overestimation of Near-Optimal Actions
Algorithm Implementation
...and 13 more sections

Key Result

Theorem 1

Let $\hat{\mathbf{q}}$ be a contraction mapping defined in the metrics space $(\mathcal{X},\|\cdot\|_{\infty})$, i.e, , where $\mathcal{B}_{\mathcal{D}}$ is the Bellman operator for the sampled MDP $\mathcal{D}$ and $\gamma$ is the discount factor. Since both $\hat{\mathbf{q}}$ and $\mathbf{q}$ are updated on the same MDP, we have the following equation:

Figures (7)

Figure 1: Agent behavior across different algorithms. Q-shaping impacts agent behavior quickly, enabling rapid evolution and improvement in the quality of heuristic functions. Vanilla refers to traditional RL algorithms, while reward shaping-enhanced RL algorithms cannot immediately impact agent behavior and have a slow verification period.
Figure 2: Q-shaping prompt. There is a general code template that specifies the required structure for the generated code. In addition to the template, three key pieces of information are necessary to generate an effective heuristic function: the code template, an introduction to the environment provided in the paper, and the environment configuration file.
Figure 3: Evaluation Environments
Figure 4: Learning curve comparison of each algorithm across 20 tasks.
Figure 5: Q-shaping improvement over the best baseline in each environment and its improvement over TD3.
...and 2 more figures

Theorems & Definitions (4)

Theorem 1: Contraction and Equivalence of $\hat{\mathbf{q}}$
proof
Theorem 2: Convergence Sample Complexity
proof

From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge

TL;DR

Abstract

From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (4)