Probing Implicit Bias in Semi-gradient Q-learning: Visualizing the Effective Loss Landscapes via the Fokker--Planck Equation

Shuyu Yin; Fei Wen; Peilin Liu; Tao Luo

Probing Implicit Bias in Semi-gradient Q-learning: Visualizing the Effective Loss Landscapes via the Fokker--Planck Equation

Shuyu Yin, Fei Wen, Peilin Liu, Tao Luo

TL;DR

The paper addresses the problem that semi-gradient Q-learning lacks an explicit loss function, hindering analysis of implicit bias. It constructs an effective loss landscape using the Fokker–Planck equation (Wang's potential landscape) from partial data to visualize training dynamics and compare negative semi-gradient and residual-gradient forces. The key finding is that global minima of the true Bellman loss can become saddle points in the effective landscape, and this bias persists in high-dimensional neural networks, as shown in both small 2D examples and more realistic grid-world/DQN settings. The authors propose a three-step approach to probe implicit bias and provide public code to reproduce the visualizations, highlighting implications for understanding and diagnosing bias in semi-gradient RL methods.

Abstract

Semi-gradient Q-learning is applied in many fields, but due to the absence of an explicit loss function, studying its dynamics and implicit bias in the parameter space is challenging. This paper introduces the Fokker--Planck equation and employs partial data obtained through sampling to construct and visualize the effective loss landscape within a two-dimensional parameter space. This visualization reveals how the global minima in the loss landscape can transform into saddle points in the effective loss landscape, as well as the implicit bias of the semi-gradient method. Additionally, we demonstrate that saddle points, originating from the global minima in loss landscape, still exist in the effective loss landscape under high-dimensional parameter spaces and neural network settings. This paper develop a novel approach for probing implicit bias in semi-gradient Q-learning.

Probing Implicit Bias in Semi-gradient Q-learning: Visualizing the Effective Loss Landscapes via the Fokker--Planck Equation

TL;DR

Abstract

Paper Structure (12 sections, 3 theorems, 25 equations, 16 figures)

This paper contains 12 sections, 3 theorems, 25 equations, 16 figures.

Introduction
Related Works
Effective Loss Landscape Visualization and Implicit Bias Demonstration
Setting and one solution scenario
Effective loss landscapes with two solutions
Divergence and implicit bias of the semi-gradient method
Implicit Bias of the Semi-gradient Method with More Realistic Data
Analyze the Transition of Global Minimum to Saddle Point in $\mathbb{R}^2$
Conclusion and Discussion
Wang's Potential Landscape Theory
Proof of Theorems
Additional Figures

Key Result

Lemma 5.2

Suppose Assumption assum:mdpAssum holds, the solution for the Bellman optimal loss with policy $\pi_1$ is $\theta_{\pi_1}(a_1) = \frac{C}{\phi(s_{\alpha}) - \gamma \phi(s'_{\alpha})}$ and $\theta_{\pi_1}(a_2) = \frac{C}{\phi(s_{\beta})} + \frac{C\gamma \phi(s'_{\beta})}{\phi(s_{\beta}) \Bigl( \phi(s

Figures (16)

Figure 1: The training dynamics of two methods initiate from the points $(-2, 1)$ (red) and $(-2, 3)$ (blue) within the landscape constructed using the data $\{(s_1, a_1, s_2, r), (s_1, a_2, s_3, r)\}$ from Example \ref{['exam::simpleMDP']}. In (a), both trajectories converge to $\theta_{\pi_1}$. Conversely, in (b), all the trajectories bias against $\theta_{\pi_2}$.
Figure 2: The geometric structure of the given MDP environment.
Figure 3: Loss landscape with the mini-batch $\{(s_1, a_1, s_2, r)$, $(s_2,a_2,s_4,r)\}$ and only $\theta_{\pi_2}$ exists. In (a), the region surrounding the global minimum exhibits a "heart" shape and lacks smoothness. Conversely, the effective loss landscape in (b) is smooth. Despite the difference of smoothness between (a) and (b), their landscape shapes are similar.
Figure 4: Loss landscapes with the mini-batch $\{(s_1, a_1, s_2, r)$, $(s_1,a_2,s_3,r)\}$, both $\theta_{\pi_1}$ (orange star) and $\theta_{\pi_2}$ (blue star) exist. In (a), two exact solutions are considered as two global minima. However in (b), $\theta_{\pi_1}$ is considered as a global minima but $\theta_{\pi_2}$ is considered as a saddle point.
Figure 5: Loss landscapes with the mini-batch $\{(s_2, a_1, s_1, r)$, $(s_2,a_2,s_4,r)\}$, both $\theta_{\pi_1}$ (orange star) and $\theta_{\pi_2}$ (blue star) exist. Compare with Figure \ref{['fig::loss_landscape_of_two_solution1']}, the saddle point shifted from $\theta_{\pi_2}$ to $\theta_{\pi_1}$.
...and 11 more figures

Theorems & Definitions (13)

Example 3.1: example for visualization
Remark 3.2
Example 4.1: a grid world environment
Lemma 5.2: existence of solution
Remark 5.3
Lemma 5.4: smoothness for effective loss landscape
Remark 5.5
Theorem 5.6: implicit bias of semi-gradient
Remark 5.7
Remark 5.8
...and 3 more

Probing Implicit Bias in Semi-gradient Q-learning: Visualizing the Effective Loss Landscapes via the Fokker--Planck Equation

TL;DR

Abstract

Probing Implicit Bias in Semi-gradient Q-learning: Visualizing the Effective Loss Landscapes via the Fokker--Planck Equation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (13)