Table of Contents
Fetching ...

Clustering Context in Off-Policy Evaluation

Daniel Guzman-Olivares, Philipp Schmidt, Jacek Golebiowski, Artur Bekasov

TL;DR

Off-policy evaluation suffers when the logging policy poorly overlaps the evaluation policy; this work introduces CHIPS, a context-clustering estimator that pools data within context clusters to improve estimation in deficient information settings. The authors provide a theoretical bias-variance analysis under Common Cluster Support and Reward Homogeneity and relaxations (delta-homogeneity), showing variance reduction relative to IPS and comparison to MIPS. Empirically, CHIPS improves estimation accuracy on synthetic problems and a real Open Bandit Dataset, with MAP reward estimation offering robustness to reward misspecification. The results highlight the tradeoffs in cluster design and hyperparameters, and suggest future work on combining CHIPS with action-embedding methods and automatic hyperparameter selection.

Abstract

Off-policy evaluation can leverage logged data to estimate the effectiveness of new policies in e-commerce, search engines, media streaming services, or automatic diagnostic tools in healthcare. However, the performance of baseline off-policy estimators like IPS deteriorates when the logging policy significantly differs from the evaluation policy. Recent work proposes sharing information across similar actions to mitigate this problem. In this work, we propose an alternative estimator that shares information across similar contexts using clustering. We study the theoretical properties of the proposed estimator, characterizing its bias and variance under different conditions. We also compare the performance of the proposed estimator and existing approaches in various synthetic problems, as well as a real-world recommendation dataset. Our experimental results confirm that clustering contexts improves estimation accuracy, especially in deficient information settings.

Clustering Context in Off-Policy Evaluation

TL;DR

Off-policy evaluation suffers when the logging policy poorly overlaps the evaluation policy; this work introduces CHIPS, a context-clustering estimator that pools data within context clusters to improve estimation in deficient information settings. The authors provide a theoretical bias-variance analysis under Common Cluster Support and Reward Homogeneity and relaxations (delta-homogeneity), showing variance reduction relative to IPS and comparison to MIPS. Empirically, CHIPS improves estimation accuracy on synthetic problems and a real Open Bandit Dataset, with MAP reward estimation offering robustness to reward misspecification. The results highlight the tradeoffs in cluster design and hyperparameters, and suggest future work on combining CHIPS with action-embedding methods and automatic hyperparameter selection.

Abstract

Off-policy evaluation can leverage logged data to estimate the effectiveness of new policies in e-commerce, search engines, media streaming services, or automatic diagnostic tools in healthcare. However, the performance of baseline off-policy estimators like IPS deteriorates when the logging policy significantly differs from the evaluation policy. Recent work proposes sharing information across similar actions to mitigate this problem. In this work, we propose an alternative estimator that shares information across similar contexts using clustering. We study the theoretical properties of the proposed estimator, characterizing its bias and variance under different conditions. We also compare the performance of the proposed estimator and existing approaches in various synthetic problems, as well as a real-world recommendation dataset. Our experimental results confirm that clustering contexts improves estimation accuracy, especially in deficient information settings.

Paper Structure

This paper contains 32 sections, 10 theorems, 61 equations, 21 figures, 2 tables.

Key Result

Proposition 3.3

Given a policy $\pi$, if Assumptions as2 and as3 hold, then we have that Please refer to apdx:proof-p1 for a complete proof.

Figures (21)

  • Figure 1: From left to right, the mean square error in the synthetic dataset experiments varying the number of clusters, the distributional shift between logging and evaluation policy ($\beta$), and the number of deficient actions in the logging data (normalized w.r.t. IPS).
  • Figure 2: ECDF of the relative mean squared error with respect to IPS for the real dataset using 50000 (left), 100000 (center), and 500000 (right) logging samples.
  • Figure 3: From left to right, MSE, Bias, and Variance of the CHIPS estimator compared to baselines while varying the number of clusters.
  • Figure 4: From left to right, MSE, Bias, and Variance of the CHIPS estimator compared to baselines while varying the number of actions.
  • Figure 5: From left to right, MSE, Bias, and Variance of the CHIPS estimator compared to baselines while varying $\beta$ values.
  • ...and 16 more figures

Theorems & Definitions (10)

  • Proposition 3.3
  • Proposition 3.4
  • Proposition 3.6
  • Proposition 3.7
  • Proposition 3.8
  • Corollary 3.9
  • Proposition 3.10
  • Proposition 3.11
  • Proposition 3.12
  • Lemma A.1