Learning to Bet for Horizon-Aware Anytime-Valid Testing

Ege Onur Taga; Samet Oymak; Shubhanshu Shekhar

Learning to Bet for Horizon-Aware Anytime-Valid Testing

Ege Onur Taga, Samet Oymak, Shubhanshu Shekhar

Abstract

We develop horizon-aware anytime-valid tests and confidence sequences for bounded means under a strict deadline $N$. Using the betting/e-process framework, we cast horizon-aware betting as a finite-horizon optimal control problem with state space $(t, \log W_t)$, where $t$ is the time and $W_t$ is the test martingale value. We first show that in certain interior regions of the state space, policies that deviate significantly from Kelly betting are provably suboptimal, while Kelly betting reaches the threshold with high probability. We then identify sufficient conditions showing that outside this region, more aggressive betting than Kelly can be better if the bettor is behind schedule, and less aggressive can be better if the bettor is ahead. Taken together these results suggest a simple phase diagram in the $(t, \log W_t)$ plane, delineating regions where Kelly, fractional Kelly, and aggressive betting may be preferable. Guided by this phase diagram, we introduce a Deep Reinforcement Learning approach based on a universal Deep Q-Network (DQN) agent that learns a single policy from synthetic experience and maps simple statistics of past observations to bets across horizons and null values. In limited-horizon experiments, the learned DQN policy yields state-of-the-art results.

Learning to Bet for Horizon-Aware Anytime-Valid Testing

Abstract

We develop horizon-aware anytime-valid tests and confidence sequences for bounded means under a strict deadline

. Using the betting/e-process framework, we cast horizon-aware betting as a finite-horizon optimal control problem with state space

, where

is the time and

is the test martingale value. We first show that in certain interior regions of the state space, policies that deviate significantly from Kelly betting are provably suboptimal, while Kelly betting reaches the threshold with high probability. We then identify sufficient conditions showing that outside this region, more aggressive betting than Kelly can be better if the bettor is behind schedule, and less aggressive can be better if the bettor is ahead. Taken together these results suggest a simple phase diagram in the

plane, delineating regions where Kelly, fractional Kelly, and aggressive betting may be preferable. Guided by this phase diagram, we introduce a Deep Reinforcement Learning approach based on a universal Deep Q-Network (DQN) agent that learns a single policy from synthetic experience and maps simple statistics of past observations to bets across horizons and null values. In limited-horizon experiments, the learned DQN policy yields state-of-the-art results.

Paper Structure (26 sections, 4 theorems, 77 equations, 9 figures, 1 table)

This paper contains 26 sections, 4 theorems, 77 equations, 9 figures, 1 table.

Introduction
Background
Horizon-Aware Testing by Betting
Horizon-Aware Confidence Sequences
From Phase Diagram to Policies: Heuristics and Deep Q-Learning
Phase Diagram of Optimal Actions
$\epsilon$-Greedy Schedules
Deep-Q-Learning
Numerical Results
Conclusion
Additional Background
Deferred Proofs
Proof of \ref{['theorem:kelly-near-optimal']}
Proof of Proposition \ref{['prop:aggressive-betting']}
Proof of Proposition \ref{['prop:defensive-betting']}
...and 11 more sections

Key Result

Theorem 3.1

Consider any time $t$, and log-wealth $y = \log W_t$. Let $b = \log(1/\alpha)$ denote the threshold, and $T = N-t$ the remaining time. Fix a $\delta > 0$, and assume that there exists an $\epsilon \equiv \epsilon(\delta)>0$, such that $|\lambda - \lambda^{\mathrm{Kelly}}_m| \geq \delta$ implies $L(

Figures (9)

Figure 1: A representative phase diagram in the $(t,\log W_t)$ plane, illustrating the qualitative partition of the plane into regimes where Kelly, aggressive, and conservative betting may be preferred under a finite horizon $N$ and significance level $\alpha$. Formal results implying such a partition are presented in \ref{['sec:testing']}.
Figure 2: The phase diagram of optimal actions in $(t, \log W_t)$ plane demonstrating the change of optimal policies with respect to problem difficulty. $X_i$ drawn from a two-component Beta mixture: $X_i \sim \mathrm{Beta}(2.4,3.6)$ with probability $1/2$ and $X_i \sim \mathrm{Beta}(4.8,7.2)$ with probability $1/2$, which has mean $\mu_X=0.40$ and varying variance across realizations. Experimental details are in Section \ref{['subsec:optimal-actions']}.
Figure 3: We compare the probability of rejection by time $t$ with $N=100$, $\alpha=0.05$, and null mean $m=0.45$. We simulate 5,000 length-$N$ sequences $\{X_i\}$ from a two-component Beta mixture. We define the hopeless region as the set of states where the optimal policy attains $\mathbb{P}(\text{reject by } N) < 10^{-4}$. In each of (a) and (b), the right panel shows the DQN's most frequently selected action, illustrating the learned phase diagram adapts across data-generating distributions. Linear-$\epsilon$ baseline is implemented with $\eta=1/2$ and the Hedge over $\epsilon$ is implemented with varying $\eta$. All-in refers to $\lambda_{\max}$. The details are provided in Section \ref{['subsec:eps-greedy']}. (a):$X_i \sim \mathrm{Beta}(2.4,3.6)$ with probability $1/2$ and $X_i \sim \mathrm{Beta}(4.8,7.2)$ with probability $1/2$. (b):$X_i \sim \mathrm{Beta}(0.4,0.6)$ w.p. $1/2$ and $X_i \sim \mathrm{Beta}(0.8,1.2)$ w.p. $1/2$.
Figure 4: We compare confidence sequences under different data-generating distributions with $\alpha =0.05$ following Section \ref{['sec:horizon-aware-CSs']}. DQN produces the narrowest confidence intervals at deadline $t=N=100$ and also yields tight intervals in the early steps. (a)$X_i \sim \mathrm{Beta}(2,6)$ w.p. $1/2$ and $X_i \sim \mathrm{Beta}(4,12)$ w.p. $1/2$. (b)$X_i \sim \mathrm{Beta}(0.4,0.6)$ w.p. $1/2$ and $X_i \sim \mathrm{Beta}(0.8,1.2)$ w.p. $1/2$. (c)$X_i \sim \mathrm{Beta}(6.5,3.5)$ w.p. $1/2$ and $X_i \sim \mathrm{Beta}(13,7)$ w.p. $1/2$.
Figure 5: We plot the lower confidence bounds across different Beta distributions and deadlines $N$. When a method's curve passes through a point $(x,y)$, it means that over 5000 repetitions, the lower confidence bound was below $x$ in a fraction $y$ of the runs. The vertical line indicates the mean of the distribution, and the horizontal line indicates the $(1-\alpha/2)$ quantile. Lower curves indicate better performance across methods. Note that in all cases, at $x=\mu_X$ we observe $y>1-\alpha/2$, consistent with the validity of the methods.
...and 4 more figures

Theorems & Definitions (9)

Theorem 3.1
Remark 3.2
Example 3.3
Proposition 3.4
Example 3.5
Proposition 3.6
Example 3.7
Lemma 1.3
proof

Learning to Bet for Horizon-Aware Anytime-Valid Testing

Abstract

Learning to Bet for Horizon-Aware Anytime-Valid Testing

Authors

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (9)