Table of Contents
Fetching ...

Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle

Kevin Song

Abstract

Infinite-shoe casino blackjack provides a rigorous, exactly verifiable benchmark for discrete stochastic control under dynamically masked actions. Under a fixed Vegas-style ruleset (S17, 3:2 payout, dealer peek, double on any two, double after split, resplit to four), an exact dynamic programming (DP) oracle was derived over 4,600 canonical decision cells. This oracle yielded ground-truth action values, optimal policy labels, and a theoretical expected value (EV) of -0.00161 per hand. To evaluate sample-efficient policy recovery, three model-free optimizers were trained via simulated interaction: masked REINFORCE with a per-cell exponential moving average baseline, simultaneous perturbation stochastic approximation (SPSA), and the cross-entropy method (CEM). REINFORCE was the most sample-efficient, achieving a 46.37% action-match rate and an EV of -0.04688 after 10^6 hands, outperforming CEM (39.46%, 7.5x10^6 evaluations) and SPSA (38.63%, 4.8x10^6 evaluations). However, all methods exhibited substantial cell-conditional regret, indicating persistent policy-level errors despite smooth reward convergence. This gap shows that tabular environments with severe state-visitation sparsity and dynamic action masking remain challenging, while aggregate reward curves can obscure critical local failures. As a negative control, it was proven and empirically confirmed that under i.i.d. draws without counting, optimal bet sizing collapses to the table minimum. In addition, larger wagers strictly increased volatility and ruin without improving expectation. These results highlight the need for exact oracles and negative controls to avoid mistaking stochastic variability for genuine algorithmic performance.

Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle

Abstract

Infinite-shoe casino blackjack provides a rigorous, exactly verifiable benchmark for discrete stochastic control under dynamically masked actions. Under a fixed Vegas-style ruleset (S17, 3:2 payout, dealer peek, double on any two, double after split, resplit to four), an exact dynamic programming (DP) oracle was derived over 4,600 canonical decision cells. This oracle yielded ground-truth action values, optimal policy labels, and a theoretical expected value (EV) of -0.00161 per hand. To evaluate sample-efficient policy recovery, three model-free optimizers were trained via simulated interaction: masked REINFORCE with a per-cell exponential moving average baseline, simultaneous perturbation stochastic approximation (SPSA), and the cross-entropy method (CEM). REINFORCE was the most sample-efficient, achieving a 46.37% action-match rate and an EV of -0.04688 after 10^6 hands, outperforming CEM (39.46%, 7.5x10^6 evaluations) and SPSA (38.63%, 4.8x10^6 evaluations). However, all methods exhibited substantial cell-conditional regret, indicating persistent policy-level errors despite smooth reward convergence. This gap shows that tabular environments with severe state-visitation sparsity and dynamic action masking remain challenging, while aggregate reward curves can obscure critical local failures. As a negative control, it was proven and empirically confirmed that under i.i.d. draws without counting, optimal bet sizing collapses to the table minimum. In addition, larger wagers strictly increased volatility and ruin without improving expectation. These results highlight the need for exact oracles and negative controls to avoid mistaking stochastic variability for genuine algorithmic performance.
Paper Structure (16 sections, 1 theorem, 6 equations, 8 figures, 5 tables)

This paper contains 16 sections, 1 theorem, 6 equations, 8 figures, 5 tables.

Key Result

Theorem 1

Assume an infinite-shoe model lacking card-counting mechanics. Let $e < 0$ be the expected return per unit wager under optimal policy $\pi^*$. For any adaptive sequence of wagers $\{b_t\}_{t=1}^N$ satisfying $b_t \geq b_\mathrm{min} > 0$, expected total profit is strictly maximized (absolute loss mi

Figures (8)

  • Figure 1: Smoothed EV per hand as a function of hands played. Policy gradient (REINFORCE) rapidly improved, representing the only optimizer to cross the 95% and 99% gap-closing thresholds relative to the oracle EV ($-0.00161$).
  • Figure 2: Empirical bet-size sweep evaluating expected return. The optimizer correctly identified the minimum legal bet as mathematically optimal.
  • Figure S1: SPSA regret heatmap for hard totals.
  • Figure S2: SPSA regret heatmap for soft totals.
  • Figure S3: SPSA regret heatmap for pair cells.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 1: No-count minimum-bet optimality
  • proof