Table of Contents
Fetching ...

Reinforcement Learning for Stock Transactions

Ziyi Zhou, Nicholas Stern, Julien Laasri

TL;DR

The paper investigates applying reinforcement learning to stock trading by formulating a time-window Markov Decision Process with actions {buy, wait} to determine optimal buy timing. It implements four agents—Baseline, Exact Q-Learning, Linear Approximate Q-Learning, and Deep Q-Learning—and augments the study with price-prediction experiments, evaluating on four tech stocks post-2005 with an 80/10/10 data split. Results show no consistent policy convergence and highly variable profits, though approximate Q-learning (especially linear) and, to a lesser extent, deep Q-learning sometimes outperform the baseline in certain stocks, indicating potential when richer state representations or more data are used. The authors discuss limitations and propose future work including expanded state features (momentum, volume, indices), external data (news), and time-series models like LSTMs to improve robustness and convergence in real-world trading scenarios.

Abstract

Much research has been done to analyze the stock market. After all, if one can determine a pattern in the chaotic frenzy of transactions, then they could make a hefty profit from capitalizing on these insights. As such, the goal of our project was to apply reinforcement learning (RL) to determine the best time to buy a stock within a given time frame. With only a few adjustments, our model can be extended to identify the best time to sell a stock as well. In order to use the format of free, real-world data to train the model, we define our own Markov Decision Process (MDP) problem. These two papers [5] [6] helped us in formulating the state space and the reward system of our MDP problem. We train a series of agents using Q-Learning, Q-Learning with linear function approximation, and deep Q-Learning. In addition, we try to predict the stock prices using machine learning regression and classification models. We then compare our agents to see if they converge on a policy, and if so, which one learned the best policy to maximize profit on the stock market.

Reinforcement Learning for Stock Transactions

TL;DR

The paper investigates applying reinforcement learning to stock trading by formulating a time-window Markov Decision Process with actions {buy, wait} to determine optimal buy timing. It implements four agents—Baseline, Exact Q-Learning, Linear Approximate Q-Learning, and Deep Q-Learning—and augments the study with price-prediction experiments, evaluating on four tech stocks post-2005 with an 80/10/10 data split. Results show no consistent policy convergence and highly variable profits, though approximate Q-learning (especially linear) and, to a lesser extent, deep Q-learning sometimes outperform the baseline in certain stocks, indicating potential when richer state representations or more data are used. The authors discuss limitations and propose future work including expanded state features (momentum, volume, indices), external data (news), and time-series models like LSTMs to improve robustness and convergence in real-world trading scenarios.

Abstract

Much research has been done to analyze the stock market. After all, if one can determine a pattern in the chaotic frenzy of transactions, then they could make a hefty profit from capitalizing on these insights. As such, the goal of our project was to apply reinforcement learning (RL) to determine the best time to buy a stock within a given time frame. With only a few adjustments, our model can be extended to identify the best time to sell a stock as well. In order to use the format of free, real-world data to train the model, we define our own Markov Decision Process (MDP) problem. These two papers [5] [6] helped us in formulating the state space and the reward system of our MDP problem. We train a series of agents using Q-Learning, Q-Learning with linear function approximation, and deep Q-Learning. In addition, we try to predict the stock prices using machine learning regression and classification models. We then compare our agents to see if they converge on a policy, and if so, which one learned the best policy to maximize profit on the stock market.

Paper Structure

This paper contains 23 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: A visualization of the stock history data for the four companies we examined. The grey section is the part we cut out, the blue section is the training data, the red is the validation data, and the green is the testing data. From this plot one can see that the data we are training our agents on is similar across the four companies, and also across the train/val/test split.
  • Figure 2: A drawing of our Markov Decision Process. The initial state is marked with an I, while terminal states are marked with T's. A decision is made on each day, shown on the timeline at the bottom. The timeline represents a single time window.
  • Figure 3: Accuracy of our predictions on the test set for the regression formulation of the problem.
  • Figure 4: Accuracy of our predictions on the test set for the regression formulation of the problem.
  • Figure 5: Histograms of the results for each company.
  • ...and 1 more figures