Table of Contents
Fetching ...

Portfolio Reinforcement Learning with Scenario-Context Rollout

Vanya Priscillia Bendatu, Yao Lu

TL;DR

This work constructs a counterfactual next state using the rollout-implied continuations and augment the critic agent's bootstrap target, which stabilizes the learning and provides a viable bias-variance tradeoff.

Abstract

Market regime shifts induce distribution shifts that can degrade the performance of portfolio rebalancing policies. We propose macro-conditioned scenario-context rollout (SCR) that generates plausible next-day multivariate return scenarios under stress events. However, doing so faces new challenges, as history will never tell what would have happened differently. As a result, incorporating scenario-based rewards from rollouts introduces a reward--transition mismatch in temporal-difference learning, destabilizing RL critic training. We analyze this inconsistency and show it leads to a mixed evaluation target. Guided by this analysis, we construct a counterfactual next state using the rollout-implied continuations and augment the critic agent's bootstrap target. Doing so stabilizes the learning and provides a viable bias-variance tradeoff. In out-of-sample evaluations across 31 distinct universes of U.S. equity and ETF portfolios, our method improves Sharpe ratio by up to 76% and reduces maximum drawdown by up to 53% compared with classic and RL-based portfolio rebalancing baselines.

Portfolio Reinforcement Learning with Scenario-Context Rollout

TL;DR

This work constructs a counterfactual next state using the rollout-implied continuations and augment the critic agent's bootstrap target, which stabilizes the learning and provides a viable bias-variance tradeoff.

Abstract

Market regime shifts induce distribution shifts that can degrade the performance of portfolio rebalancing policies. We propose macro-conditioned scenario-context rollout (SCR) that generates plausible next-day multivariate return scenarios under stress events. However, doing so faces new challenges, as history will never tell what would have happened differently. As a result, incorporating scenario-based rewards from rollouts introduces a reward--transition mismatch in temporal-difference learning, destabilizing RL critic training. We analyze this inconsistency and show it leads to a mixed evaluation target. Guided by this analysis, we construct a counterfactual next state using the rollout-implied continuations and augment the critic agent's bootstrap target. Doing so stabilizes the learning and provides a viable bias-variance tradeoff. In out-of-sample evaluations across 31 distinct universes of U.S. equity and ETF portfolios, our method improves Sharpe ratio by up to 76% and reduces maximum drawdown by up to 53% compared with classic and RL-based portfolio rebalancing baselines.
Paper Structure (17 sections, 5 theorems, 60 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 5 theorems, 60 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Lemma 3.1

Fix $\phi$ and take any bounded $V$. We define the function Then for any policy $\pi$,

Figures (4)

  • Figure 1: Overview of our RL paradigm.SCR produces a conditional scenario distribution over next-day joint return vectors. The critic is trained using a bootstrap target that augments the realized continuation bootstrap with a counterfactual continuation.
  • Figure 2: We stress test under shocks. The wealths are rebased to 1 at the window start. Top: COVID-19 sell-off window. Bottom: 2021–2022 macro shock window.
  • Figure 3: Scenario-to-real validation. Cumulative average daily return under SCR scenario scoring versus realized tape returns.
  • Figure 4: Critic stability under logged-tape mismatch. Bellman residual $\mathrm{resid}_{\ell_2}$ versus training progress (mean $\pm$ 95% CI). Compared to PPO (Historical Replay), SCR--PPO--Full attains smaller residuals, and improved critic stability.

Theorems & Definitions (12)

  • Definition 2.1: Quantile-normalized regime context $g_t$
  • Lemma 3.1
  • Proposition 3.2
  • Corollary 3.3
  • Theorem 3.5: One-step mixing bound (Wasserstein form)
  • Corollary 3.6
  • Definition A.1: Induced continuation distributions
  • proof
  • proof
  • proof
  • ...and 2 more