Table of Contents
Fetching ...

Context-Action Embedding Learning for Off-Policy Evaluation in Contextual Bandits

Kushagra Chandak, Vincent Liu, Haanvid Lee

TL;DR

This work tackles the high variance of off-policy evaluation in contextual bandits with large action spaces by learning context-action embeddings that directly minimize the MSE of the Marginalized IPS estimator. CAEL-MIPS derives upper bounds on the bias and variance of MIPS and forms an MSE-minimizing objective that blends reward prediction with bias and variance considerations, enabling embeddings conditioned on both context and action. The proposed method is validated on synthetic data and a real-world Open Bandit Dataset, showing reductions in MSE and favorable bias-variance trade-offs compared with IPS, AEL-MIPS, and DM. The results indicate that context-action conditioned embeddings can significantly improve offline policy evaluation, with practical implications for rapid iteration in recommender systems and digital marketing.

Abstract

We consider off-policy evaluation (OPE) in contextual bandits with finite action space. Inverse Propensity Score (IPS) weighting is a widely used method for OPE due to its unbiased, but it suffers from significant variance when the action space is large or when some parts of the context-action space are underexplored. Recently introduced Marginalized IPS (MIPS) estimators mitigate this issue by leveraging action embeddings. However, these embeddings do not minimize the mean squared error (MSE) of the estimators and do not consider context information. To address these limitations, we introduce Context-Action Embedding Learning for MIPS, or CAEL-MIPS, which learns context-action embeddings from offline data to minimize the MSE of the MIPS estimator. Building on the theoretical analysis of bias and variance of MIPS, we present an MSE-minimizing objective for CAEL-MIPS. In the empirical studies on a synthetic dataset and a real-world dataset, we demonstrate that our estimator outperforms baselines in terms of MSE.

Context-Action Embedding Learning for Off-Policy Evaluation in Contextual Bandits

TL;DR

This work tackles the high variance of off-policy evaluation in contextual bandits with large action spaces by learning context-action embeddings that directly minimize the MSE of the Marginalized IPS estimator. CAEL-MIPS derives upper bounds on the bias and variance of MIPS and forms an MSE-minimizing objective that blends reward prediction with bias and variance considerations, enabling embeddings conditioned on both context and action. The proposed method is validated on synthetic data and a real-world Open Bandit Dataset, showing reductions in MSE and favorable bias-variance trade-offs compared with IPS, AEL-MIPS, and DM. The results indicate that context-action conditioned embeddings can significantly improve offline policy evaluation, with practical implications for rapid iteration in recommender systems and digital marketing.

Abstract

We consider off-policy evaluation (OPE) in contextual bandits with finite action space. Inverse Propensity Score (IPS) weighting is a widely used method for OPE due to its unbiased, but it suffers from significant variance when the action space is large or when some parts of the context-action space are underexplored. Recently introduced Marginalized IPS (MIPS) estimators mitigate this issue by leveraging action embeddings. However, these embeddings do not minimize the mean squared error (MSE) of the estimators and do not consider context information. To address these limitations, we introduce Context-Action Embedding Learning for MIPS, or CAEL-MIPS, which learns context-action embeddings from offline data to minimize the MSE of the MIPS estimator. Building on the theoretical analysis of bias and variance of MIPS, we present an MSE-minimizing objective for CAEL-MIPS. In the empirical studies on a synthetic dataset and a real-world dataset, we demonstrate that our estimator outperforms baselines in terms of MSE.

Paper Structure

This paper contains 21 sections, 4 theorems, 33 equations, 3 figures, 1 algorithm.

Key Result

Proposition 1

Under Assumptions as:common-embed-supp and as:bounded-mean-var, the upper bound of MIPS estimator's bias is given by

Figures (3)

  • Figure 1: Bias, variance, and MSE comparison between CAEL-MIPS, AEL-MIPS, IPS, and DM (AEL-MIPS refers to the method of cief2024learning). The vertical axis is plotted on the log scale. The values are estimated by using 1000 samples and averaging over 30 independent runs.
  • Figure 2: Empirical cumulative distribution function (CDF) of relative errors of different estimators for the real-world bandit dataset. The relative errors are computed by dividing the squared errors of all the estimators by the squared error of IPS.
  • Figure 3: MSE, bias, and variance of IPS, DM, AEL-MIPS, and CAEL-MIPS as a function of various problem parameters for the synthetic dataset. For all the plots, the vertical axis is plotted on the log scale and the shaded regions represent 95% confidence interval. \ref{['fig:n_val', 'fig:n_val_bias', 'fig:n_val_var']} show the MSE, bias and variance as the number of samples in the dataset increases. \ref{['fig:n_act', 'fig:n_act_bias', 'fig:n_act_var']} show the MSE, bias, and variance as the number of actions increases. \ref{['fig:beta', 'fig:eps', 'fig:r_std']} shows the MSE of all the estimators as a function of various environment parameters, where softmax parameter is the inverse temperature parameter of the softmax behavior policy, and Epsilon is the suboptimality of the target policy. We observe that the trend for MIPS-based estimators is similar to IPS but with a larger reduction in MSE due to variance reduction. Furthermore, CAEL-MIPS has the lowest MSE across different environment settings since it learns embeddings that balance bias-variance trade-off.

Theorems & Definitions (8)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • proof : Proof of \ref{['prop:entropy-bias']}
  • proof : Proof of \ref{['prop:entropy-var']}
  • proof : Proof of \ref{['prop:bias-bound']}
  • proof : Proof of \ref{['prop:var-bound']}