Finding Optimal Trading History in Reinforcement Learning for Stock Market Trading
Sina Montazeri, Haseebullah Jumakhan, Amir Mirzaeinia
TL;DR
The paper addresses how the temporal window of input data affects CNN-based policies within a deep reinforcement learning framework for stock trading. It treats the observation window as a hyperparameter and tests an iterative window-expansion strategy from 2 to 12 weeks, using two feature arrangements and two FinRL-derived datasets. Across Dow Jones–level datasets, short windows excel without per-company rearrangement, while longer windows become advantageous when features are rearranged, illustrating a strong interaction between data structuring and temporal context. The findings demonstrate that a CNN-DRL trading model can outperform traditional benchmarks like the Global X Guru ETF, offering practical implications for hedge funds and high-frequency trading while highlighting the need for computational resources and careful data preprocessing.
Abstract
This paper investigates the optimization of temporal windows in Financial Deep Reinforcement Learning (DRL) models using 2D Convolutional Neural Networks (CNNs). We introduce a novel approach to treating the temporal field as a hyperparameter and examine its impact on model performance across various datasets and feature arrangements. We introduce a new hyperparameter for the CNN policy, proposing that this temporal field can and should be treated as a hyperparameter for these models. We examine the significance of this temporal field by iteratively expanding the window of observations presented to the CNN policy during the deep reinforcement learning process. Our iterative process involves progressively increasing the observation period from two weeks to twelve weeks, allowing us to examine the effects of different temporal windows on the model's performance. This window expansion is implemented in two settings. In one setting, we rearrange the features in the dataset to group them by company, allowing the model to have a full view of company data in its observation window and CNN kernel. In the second setting, we do not group the features by company, and features are arranged by category. Our study reveals that shorter temporal windows are most effective when no feature rearrangement to group per company is in effect. However, the model will utilize longer temporal windows and yield better performance once we introduce the feature rearrangement. To examine the consistency of our findings, we repeated our experiment on two datasets containing the same thirty companies from the Dow Jones Index but with different features in each dataset and consistently observed the above-mentioned patterns. The result is a trading model significantly outperforming global financial services firms such as the Global X Guru by the established Mirae Asset.
