Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies

Haanvid Lee; Tri Wahyu Guntara; Jongmin Lee; Yung-Kyun Noh; Kee-Eung Kim

Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies

Haanvid Lee, Tri Wahyu Guntara, Jongmin Lee, Yung-Kyun Noh, Kee-Eung Kim

TL;DR

This work tackles off-policy evaluation for deterministic policies in continuous-action RL by introducing KMIFQE, which relaxes the deterministic target with a kernel and learns a local Mahalanobis metric to minimize the mean-squared error of the TD update. The authors derive bias-variance decompositions, obtain a closed-form optimal bandwidth $h^*$ and a closed-form optimal metric $A^*$, and prove an error-bound relating the stochastic-relaxed Bellman operator to the deterministic target. Empirically, KMIFQE yields improved accuracy over SR-DICE and FQE across OpenAI Gym Pendulum, MuJoCo, and D4RL datasets, particularly when the action space is high-dimensional or contains noisy dummy dimensions. Theoretical guarantees, together with practical bandwidth and metric learning, demonstrate KMIFQE's potential to enable stable in-sample OPE for deterministic policies in real-world continuous-control domains.

Abstract

We consider off-policy evaluation (OPE) of deterministic target policies for reinforcement learning (RL) in environments with continuous action spaces. While it is common to use importance sampling for OPE, it suffers from high variance when the behavior policy deviates significantly from the target policy. In order to address this issue, some recent works on OPE proposed in-sample learning with importance resampling. Yet, these approaches are not applicable to deterministic target policies for continuous action spaces. To address this limitation, we propose to relax the deterministic target policy using a kernel and learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function, where the action value function is used for policy evaluation. We derive the bias and variance of the estimation error due to this relaxation and provide analytic solutions for the optimal kernel metric. In empirical studies using various test domains, we show that the OPE with in-sample learning using the kernel with optimized metric achieves significantly improved accuracy than other baselines.

Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies

TL;DR

and a closed-form optimal metric

, and prove an error-bound relating the stochastic-relaxed Bellman operator to the deterministic target. Empirically, KMIFQE yields improved accuracy over SR-DICE and FQE across OpenAI Gym Pendulum, MuJoCo, and D4RL datasets, particularly when the action space is high-dimensional or contains noisy dummy dimensions. Theoretical guarantees, together with practical bandwidth and metric learning, demonstrate KMIFQE's potential to enable stable in-sample OPE for deterministic policies in real-world continuous-control domains.

Abstract

Paper Structure (51 sections, 11 theorems, 34 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 51 sections, 11 theorems, 34 equations, 2 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Background
In-sample TD learning
Kernel metric learning
Kernel metric learning for in-sample TD learning
MSE derivation
Optimal bandwidth and metric
Error Bound Analysis
Experiments
Pendulum with dummy action dimensions
Continuous control tasks with a known behavior policy
Continuous control tasks with unknown multiple behavior policies
Conclusion
Acknowledgements
...and 36 more sections

Key Result

Theorem 1

Under Assumption assump:support-assump:Q_twice_diff, the bias and variance of $\widehat{\Delta}_{IR}^K$ are: where $\nabla^2_{\mathop{\mathrm{\mathbf{a}}}\limits'}$ is a Laplacian operator w.r.t. $\mathop{\mathrm{\mathbf{a}^\prime}}\limits$, $\operatorname{Var}[\mathbf{z}]:=\operatorname{tr}[\operatorname{Cov}(\mathbf{z}, \mathbf{z})]$ for a vector $\mathop{\mathrm{\mathbf{z}}}\limits$, $\mathop{

Figures (2)

Figure 1: (a) Empirical bias and variance of the KMIFQE with and without metric learning as the number of dummy action dimensions increases. (b) Performance of KMIFQE with and without metric learning under various given bandwidths. The bandwidths learned by KMIFQE are plotted as vertical lines along with makers indicating the MSEs. The shaded area is the region within one standard error. All experiments are repeated for 10 trials.
Figure 2: Visualization of $Q^\pi$ estimated by KMIFQE and FQE, along with the kernel metrics learned by KMIFQE in the modified Pendulum-v0 domain with one dummy action dimension. The original action dimension is $a_1$, and the dummy action dimension is $a_2$. The leftmost column illustrates the given states. The center column shows the Q-landscapes and metrics (black crosses) learned by KMIFQE. The rightmost column shows the Q-landscapes learned by FQE. The target actions at the given states are presented in yellow circles.

Theorems & Definitions (16)

Theorem 1
Corollary 1
Proposition 1
Proposition 2
Proposition 3
Theorem 2
Theorem \ref{thm:bias_var}
proof
Corollary \ref{corollary:MSE}
proof
...and 6 more

Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies

TL;DR

Abstract

Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (16)