Off-policy Evaluation for Payments at Adyen
Alex Egg
TL;DR
This work demonstrates that Off-Policy Evaluation can accelerate recommender-system optimization in a high-volume payments context by leveraging historical data to evaluate new policies without costly online experiments. The authors implement and benchmark four OPE estimators (DM, IPS, SNIPS, DR) in a Spark-based pipeline and find that IPS and SNIPS correlate strongly with online AB results (often >0.8), while Direct Method and Doubly Robust underperform in this setting. Their results suggest OPE can reveal winning variants and estimate incremental transaction impact (9–54 million over six months), acting as a pre-screening tool to reduce AB testing cycles. The study highlights practical considerations such as exploration traffic usage, variance control, and large-scale infrastructure, arguing for a complementary rather than replacement role for OPE in industrial deployments. Overall, the paper provides actionable guidance for deploying OPE in large-scale financial recommender systems and outlines directions for future integration of end-to-end off-policy learning.
Abstract
This paper demonstrates the successful application of Off-Policy Evaluation (OPE) to accelerate recommender system development and optimization at Adyen, a global leader in financial payment processing. Facing the limitations of traditional A/B testing, which proved slow, costly, and often inconclusive, we integrated OPE to enable rapid evaluation of new recommender system variants using historical data. Our analysis, conducted on a billion-scale dataset of transactions, reveals a strong correlation between OPE estimates and online A/B test results, projecting an incremental 9--54 million transactions over a six-month period. We explore the practical challenges and trade-offs associated with deploying OPE in a high-volume production environment, including leveraging exploration traffic for data collection, mitigating variance in importance sampling, and ensuring scalability through the use of Apache Spark. By benchmarking various OPE estimators, we provide guidance on their effectiveness and integration into the decision-making systems for large-scale industrial payment systems.
