Portfolio Reinforcement Learning with Scenario-Context Rollout

Vanya Priscillia Bendatu; Yao Lu

Portfolio Reinforcement Learning with Scenario-Context Rollout

Vanya Priscillia Bendatu, Yao Lu

TL;DR

This work constructs a counterfactual next state using the rollout-implied continuations and augment the critic agent's bootstrap target, which stabilizes the learning and provides a viable bias-variance tradeoff.

Abstract

Market regime shifts induce distribution shifts that can degrade the performance of portfolio rebalancing policies. We propose macro-conditioned scenario-context rollout (SCR) that generates plausible next-day multivariate return scenarios under stress events. However, doing so faces new challenges, as history will never tell what would have happened differently. As a result, incorporating scenario-based rewards from rollouts introduces a reward--transition mismatch in temporal-difference learning, destabilizing RL critic training. We analyze this inconsistency and show it leads to a mixed evaluation target. Guided by this analysis, we construct a counterfactual next state using the rollout-implied continuations and augment the critic agent's bootstrap target. Doing so stabilizes the learning and provides a viable bias-variance tradeoff. In out-of-sample evaluations across 31 distinct universes of U.S. equity and ETF portfolios, our method improves Sharpe ratio by up to 76% and reduces maximum drawdown by up to 53% compared with classic and RL-based portfolio rebalancing baselines.

Portfolio Reinforcement Learning with Scenario-Context Rollout

TL;DR

Abstract

Paper Structure (17 sections, 5 theorems, 60 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 5 theorems, 60 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Method
Scenario-Context Rollout (SCR)
Counterfactual Continuation for Critic Target Augmentation
Continuation Mismatch and Fixes
Experiments
Experimental Setup
Evaluation on Out-of-Sample Regimes
Ablations and Scenario-to-Real Analysis
Sensitivity Analysis
Conclusion
Additional Proofs
Proof of Lemma \ref{['lem:reward_cancels']}
Proof of Proposition \ref{['prop:op_gap']}
Proof of Corollary \ref{['cor:fixed_point_bias']}
...and 2 more sections

Key Result

Lemma 3.1

Fix $\phi$ and take any bounded $V$. We define the function Then for any policy $\pi$,

Figures (4)

Figure 1: Overview of our RL paradigm.SCR produces a conditional scenario distribution over next-day joint return vectors. The critic is trained using a bootstrap target that augments the realized continuation bootstrap with a counterfactual continuation.
Figure 2: We stress test under shocks. The wealths are rebased to 1 at the window start. Top: COVID-19 sell-off window. Bottom: 2021–2022 macro shock window.
Figure 3: Scenario-to-real validation. Cumulative average daily return under SCR scenario scoring versus realized tape returns.
Figure 4: Critic stability under logged-tape mismatch. Bellman residual $\mathrm{resid}_{\ell_2}$ versus training progress (mean $\pm$ 95% CI). Compared to PPO (Historical Replay), SCR--PPO--Full attains smaller residuals, and improved critic stability.

Theorems & Definitions (12)

Definition 2.1: Quantile-normalized regime context $g_t$
Lemma 3.1
Proposition 3.2
Corollary 3.3
Theorem 3.5: One-step mixing bound (Wasserstein form)
Corollary 3.6
Definition A.1: Induced continuation distributions
proof
proof
proof
...and 2 more

Portfolio Reinforcement Learning with Scenario-Context Rollout

TL;DR

Abstract

Portfolio Reinforcement Learning with Scenario-Context Rollout

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (12)