Comparing Different Transformer Model Structures for Stock Prediction
Qizhao Chen
TL;DR
This paper investigates how Transformer structure affects stock index forecasting by comparing five Transformer variants on the S&P 500 data using multiple sliding-window sizes and horizons. It benchmarks these against traditional baselines (LSTM, TCN, SVR, RF) and analyzes performance across different input-output windows. The results show that decoder-only Transformers consistently outperform alternatives, while ProbSparse attention often underperforms; embedding-free variants can be competitive in some settings. The study provides practical guidance on selecting Transformer architectures for efficient and accurate time-series forecasting in finance.
Abstract
This paper compares different Transformer model architectures for stock index prediction. While many studies have shown that Transformers perform well in stock price forecasting, few have explored how different structural designs impact performance. Most existing works treat the Transformer as a black box, overlooking how specific architectural choices may affect predictive accuracy. However, understanding these differences is critical for developing more effective forecasting models. This study aims to identify which Transformer variant is most suitable for stock forecasting. This study evaluates five Transformer structures: (1) encoder-only Transformer, (2) decoder-only Transformer, (3) Vanilla Transformer (encoder + decoder), (4) Vanilla Transformer without embedding layers, and (5) Vanilla Transformer with ProbSparse attention. Results show that Transformer-based models generally outperform traditional approaches. Transformer with decoder only structure outperforms all other models in all scenarios. Transformer with ProbSparse attention has the worst performance in almost all cases.
