Table of Contents
Fetching ...

A Simpler Alternative to Variational Regularized Counterfactual Risk Minimization

Hua Chang Bakker, Shashank Gupta, Harrie Oosterhuis

TL;DR

This work revisits the original experimental setting of VRCRM and proposes a novel simpler alternative to f-divergence optimization by minimizing a direct approximation of f-divergence directly, instead of a direct approximation of f-divergence directly using aGAN based lower bound.

Abstract

Variance regularized counterfactual risk minimization (VRCRM) has been proposed as an alternative off-policy learning (OPL) method. VRCRM method uses a lower-bound on the $f$-divergence between the logging policy and the target policy as regularization during learning and was shown to improve performance over existing OPL alternatives on multi-label classification tasks. In this work, we revisit the original experimental setting of VRCRM and propose to minimize the $f$-divergence directly, instead of optimizing for the lower bound using a $f$-GAN approach. Surprisingly, we were unable to reproduce the results reported in the original setting. In response, we propose a novel simpler alternative to f-divergence optimization by minimizing a direct approximation of f-divergence directly, instead of a $f$-GAN based lower bound. Experiments showed that minimizing the divergence using $f$-GANs did not work as expected, whereas our proposed novel simpler alternative works better empirically.

A Simpler Alternative to Variational Regularized Counterfactual Risk Minimization

TL;DR

This work revisits the original experimental setting of VRCRM and proposes a novel simpler alternative to f-divergence optimization by minimizing a direct approximation of f-divergence directly, instead of a direct approximation of f-divergence directly using aGAN based lower bound.

Abstract

Variance regularized counterfactual risk minimization (VRCRM) has been proposed as an alternative off-policy learning (OPL) method. VRCRM method uses a lower-bound on the -divergence between the logging policy and the target policy as regularization during learning and was shown to improve performance over existing OPL alternatives on multi-label classification tasks. In this work, we revisit the original experimental setting of VRCRM and propose to minimize the -divergence directly, instead of optimizing for the lower bound using a -GAN approach. Surprisingly, we were unable to reproduce the results reported in the original setting. In response, we propose a novel simpler alternative to f-divergence optimization by minimizing a direct approximation of f-divergence directly, instead of a -GAN based lower bound. Experiments showed that minimizing the divergence using -GANs did not work as expected, whereas our proposed novel simpler alternative works better empirically.
Paper Structure (11 sections, 6 equations, 2 figures)

This paper contains 11 sections, 6 equations, 2 figures.

Figures (2)

  • Figure 1: Performance of the methods for divergence minimization. Values closer to the logging policy are better. Scores significantly different from the logging policy scores are indicated with filled markers. Shaded areas indicate standard deviations.
  • Figure 2: Performance of IPS-based methods on synt-25-15. Scores significantly different from the IPS scores are indicated with filled markers. Shaded areas indicate standard deviations.