Table of Contents
Fetching ...

Regret-Optimized Portfolio Enhancement through Deep Reinforcement Learning and Future Looking Rewards

Daniil Karzanov, Rubén Garzón, Mikhail Terekhov, Caglar Gulcehre, Thomas Raffinot, Marcin Detyniecki

TL;DR

This work tackles improving a traditional 60/40 portfolio via dynamic rebalancing driven by Proximal Policy Optimization (PPO) using a novel future-looking regret reward and an Oracle-guided action, incorporating transaction costs and synthetic data. It introduces a forward-looking Sharpe-based regret formulation, a transaction-cost scheduler, and circular block bootstrap synthetic data to bolster generalization. Empirical results across three market regimes show the regret-based approach often yields higher returns and favorable risk metrics (e.g., maximum drawdown and Calmar/Sharpe ratios) compared to the static benchmark, with notable gains in the most recent phase. The study demonstrates the viability of regret-informed, transaction-aware PPO for robust, adaptive portfolio management and points to future extensions like action-space discretization and regime-aware training for crisis adaptation.

Abstract

This paper introduces a novel agent-based approach for enhancing existing portfolio strategies using Proximal Policy Optimization (PPO). Rather than focusing solely on traditional portfolio construction, our approach aims to improve an already high-performing strategy through dynamic rebalancing driven by PPO and Oracle agents. Our target is to enhance the traditional 60/40 benchmark (60% stocks, 40% bonds) by employing the Regret-based Sharpe reward function. To address the impact of transaction fee frictions and prevent signal loss, we develop a transaction cost scheduler. We introduce a future-looking reward function and employ synthetic data training through a circular block bootstrap method to facilitate the learning of generalizable allocation strategies. We focus on two key evaluation measures: return and maximum drawdown. Given the high stochasticity of financial markets, we train 20 independent agents each period and evaluate their average performance against the benchmark. Our method not only enhances the performance of the existing portfolio strategy through strategic rebalancing but also demonstrates strong results compared to other baselines.

Regret-Optimized Portfolio Enhancement through Deep Reinforcement Learning and Future Looking Rewards

TL;DR

This work tackles improving a traditional 60/40 portfolio via dynamic rebalancing driven by Proximal Policy Optimization (PPO) using a novel future-looking regret reward and an Oracle-guided action, incorporating transaction costs and synthetic data. It introduces a forward-looking Sharpe-based regret formulation, a transaction-cost scheduler, and circular block bootstrap synthetic data to bolster generalization. Empirical results across three market regimes show the regret-based approach often yields higher returns and favorable risk metrics (e.g., maximum drawdown and Calmar/Sharpe ratios) compared to the static benchmark, with notable gains in the most recent phase. The study demonstrates the viability of regret-informed, transaction-aware PPO for robust, adaptive portfolio management and points to future extensions like action-space discretization and regime-aware training for crisis adaptation.

Abstract

This paper introduces a novel agent-based approach for enhancing existing portfolio strategies using Proximal Policy Optimization (PPO). Rather than focusing solely on traditional portfolio construction, our approach aims to improve an already high-performing strategy through dynamic rebalancing driven by PPO and Oracle agents. Our target is to enhance the traditional 60/40 benchmark (60% stocks, 40% bonds) by employing the Regret-based Sharpe reward function. To address the impact of transaction fee frictions and prevent signal loss, we develop a transaction cost scheduler. We introduce a future-looking reward function and employ synthetic data training through a circular block bootstrap method to facilitate the learning of generalizable allocation strategies. We focus on two key evaluation measures: return and maximum drawdown. Given the high stochasticity of financial markets, we train 20 independent agents each period and evaluate their average performance against the benchmark. Our method not only enhances the performance of the existing portfolio strategy through strategic rebalancing but also demonstrates strong results compared to other baselines.

Paper Structure

This paper contains 17 sections, 13 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: The evolution of accumulated (financial) return over training. Ablation study: removal of specific model components to assess their impact. TC: inclusion of transaction cost schedule. BB: block bootstrap synthetic data. Regret: our reward function. Return: purely optimizing for returns, without including MDD or risk in the reward. Each configuration is averaged over 20 runs. The full configuration (TC + BB + Regret) generalizes better despite underperformance during training due to increased variability. Synthetic data functions as a form of regularization, preventing memorization of non-reproducible strategies. When used alone, BB and TC are less effective than when combined. However, the combination of TC, BB, and the Return reward function (which specifically optimizes the value displayed on the Y-axis of the plot) tends to overfit on train, resulting in an inability to generate profitable out-of-sample strategies.
  • Figure 2: Example of allocation during the testing period of phase 3. The regret-based agent demonstrates a general alignment with the optimal allocation but adopts a less aggressive stance. During bullish market periods, it increases its positions in risky assets, capitalizing on favorable conditions. Conversely, when market uncertainty arises, the agent shifts to more conservative allocations, showcasing its adaptability to changing market dynamics.
  • Figure 3: Distribution of returns and MDDs for the PPO agent for differential Sharpe, embedded drawdown and regret (ours) reward functions compared to the 60/40 benchmark (black dashed line) on the test periods. Our approach (in blue) focuses more on returns, allowing it to outperform all other methods in the primary objective across all three phases. Our model has proven superior to all considered approaches in the latest trading phase in both objectives.
  • Figure 4: Transaction cost schedule example.
  • Figure 5: Example of allocation and return dynamics by Diff. Sharpe (\ref{['eq:diff_sharpe']}) during the testing period of phase 3.
  • ...and 2 more figures