Table of Contents
Fetching ...

Learning Action Embeddings for Off-Policy Evaluation

Matej Cief, Jacek Golebiowski, Philipp Schmidt, Ziawasch Abedjan, Artur Bekasov

TL;DR

This work uses intermediate outputs of a trained reward model to define action embeddings for MIPS, and in experiments improves upon MIPS with pre-definedembeddings, as well as standard baselines, both on synthetic and real-world data.

Abstract

Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. OPE is a viable alternative to running expensive online A/B tests: it can speed up the development of new policies, and reduces the risk of exposing customers to suboptimal treatments. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims (arXiv:2202.06317v2 [cs.LG]) propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.

Learning Action Embeddings for Off-Policy Evaluation

TL;DR

This work uses intermediate outputs of a trained reward model to define action embeddings for MIPS, and in experiments improves upon MIPS with pre-definedembeddings, as well as standard baselines, both on synthetic and real-world data.

Abstract

Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. OPE is a viable alternative to running expensive online A/B tests: it can speed up the development of new policies, and reduces the risk of exposing customers to suboptimal treatments. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims (arXiv:2202.06317v2 [cs.LG]) propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.
Paper Structure (12 sections, 12 equations, 4 figures, 1 table)

This paper contains 12 sections, 12 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The model we use to learn the action embeddings. We use a simple linear model in this work, but any model class can be used. Input $a$ can be a one-hot encoded representation of the action identity, a pre-defined action embedding, or a concatenated vector of both.
  • Figure 2: Synthetic experiments varying the number of actions and training samples. To better distinguish comparable methods, those marked with $\triangledown$ only use action identities, and those marked with $\times$ also use pre-defined action embeddings. Learned MIPS OneHot outperforms all standard baselines. When we have enough data for every action, it performs just as well as IPS. As the variance grows with fewer samples per action, its error approaches the one of DM using the same model. The pre-defined embeddings ($d_\mathcal{E} = 3$) have low bias and variance; hence our methods can not improve upon them. Shaded areas around the lines are standard errors (almost invisible).
  • Figure 3: Synthetic experiments varying the number of unobserved dimensions. As we progressively hide more dimensions, pre-defined embeddings get more biased and methods using them get less accurate. Combining Learned MIPS OneHot and FineTune yields the most robust results when the bias of pre-defined embeddings is unknown. Shaded areas around the lines are standard errors (almost invisible).
  • Figure 4: CDF of relative MSEs w.r.t IPS on the real-world dataset. The intersection of a method's curve with the IPS curve tells us the proportion of experiments in which the method performs better than IPS.