From Weighting to Modeling: A Nonparametric Estimator for Off-Policy Evaluation

Rong J. B. Zhu

From Weighting to Modeling: A Nonparametric Estimator for Off-Policy Evaluation

Rong J. B. Zhu

TL;DR

This work addresses the limitation of IPW by proposing a Nonparametric Weighting (NW) approach that constructs weights using a nonparametric model, and incorporates reward predictions -- similar to the DR technique -- resulting in the Model-assisted Nonparametric Weighting (MNW) approach.

Abstract

We study off-policy evaluation in the setting of contextual bandits, where we aim to evaluate a new policy using historical data that consists of contexts, actions and received rewards. This historical data typically does not faithfully represent action distribution of the new policy accurately. A common approach, inverse probability weighting (IPW), adjusts for these discrepancies in action distributions. However, this method often suffers from high variance due to the probability being in the denominator. The doubly robust (DR) estimator reduces variance through modeling reward but does not directly address variance from IPW. In this work, we address the limitation of IPW by proposing a Nonparametric Weighting (NW) approach that constructs weights using a nonparametric model. Our NW approach achieves low bias like IPW but typically exhibits significantly lower variance. To further reduce variance, we incorporate reward predictions -- similar to the DR technique -- resulting in the Model-assisted Nonparametric Weighting (MNW) approach. The MNW approach yields accurate value estimates by explicitly modeling and mitigating bias from reward modeling, without aiming to guarantee the standard doubly robust property. Extensive empirical comparisons show that our approaches consistently outperform existing techniques, achieving lower variance in value estimation while maintaining low bias.

From Weighting to Modeling: A Nonparametric Estimator for Off-Policy Evaluation

TL;DR

Abstract

Paper Structure (18 sections, 3 theorems, 39 equations, 1 figure, 6 tables, 2 algorithms)

This paper contains 18 sections, 3 theorems, 39 equations, 1 figure, 6 tables, 2 algorithms.

Introduction
Off-policy Evaluation
Nonparametric Framework of Policy Evaluation
Nonparametric Model Framework
Equivalent representations for off-policy evaluation
Nonparametric model framework
Illustration of the two cases
Nonparametric Estimation
Nonparametric Weighting for Policy Evaluation
Robustness to Behavior Policy Estimation
Error Analysis
An Illustrative Example
Model-assisted Nonparametric Weighting
Error Analysis
An Illustrative Example
...and 3 more sections

Key Result

Proposition 3.1

Under the definition of $f^{\pi}(p_{ia})$ in Eqn. cond-e, the value $V^{\pi}$ of the target policy admits the following equivalent representations. (a) Design-based representation: (b) Model-based representation:

Figures (1)

Figure 1: Performance across various sample size for the Page dataset. Left: RMSE as a function of sample size; Right: Bias as a function of sample size.

Theorems & Definitions (4)

Proposition 3.1
proof
Proposition 3.2
Proposition 4.1

From Weighting to Modeling: A Nonparametric Estimator for Off-Policy Evaluation

TL;DR

Abstract

From Weighting to Modeling: A Nonparametric Estimator for Off-Policy Evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (4)