Context-Action Embedding Learning for Off-Policy Evaluation in Contextual Bandits
Kushagra Chandak, Vincent Liu, Haanvid Lee
TL;DR
This work tackles the high variance of off-policy evaluation in contextual bandits with large action spaces by learning context-action embeddings that directly minimize the MSE of the Marginalized IPS estimator. CAEL-MIPS derives upper bounds on the bias and variance of MIPS and forms an MSE-minimizing objective that blends reward prediction with bias and variance considerations, enabling embeddings conditioned on both context and action. The proposed method is validated on synthetic data and a real-world Open Bandit Dataset, showing reductions in MSE and favorable bias-variance trade-offs compared with IPS, AEL-MIPS, and DM. The results indicate that context-action conditioned embeddings can significantly improve offline policy evaluation, with practical implications for rapid iteration in recommender systems and digital marketing.
Abstract
We consider off-policy evaluation (OPE) in contextual bandits with finite action space. Inverse Propensity Score (IPS) weighting is a widely used method for OPE due to its unbiased, but it suffers from significant variance when the action space is large or when some parts of the context-action space are underexplored. Recently introduced Marginalized IPS (MIPS) estimators mitigate this issue by leveraging action embeddings. However, these embeddings do not minimize the mean squared error (MSE) of the estimators and do not consider context information. To address these limitations, we introduce Context-Action Embedding Learning for MIPS, or CAEL-MIPS, which learns context-action embeddings from offline data to minimize the MSE of the MIPS estimator. Building on the theoretical analysis of bias and variance of MIPS, we present an MSE-minimizing objective for CAEL-MIPS. In the empirical studies on a synthetic dataset and a real-world dataset, we demonstrate that our estimator outperforms baselines in terms of MSE.
