Table of Contents
Fetching ...

Integrating Inductive Biases in Transformers via Distillation for Financial Time Series Forecasting

Yu-Chen Den, Kuan-Yu Chen, Kendro Vincent, Darby Tien-Hao Chang

Abstract

Transformer-based models have been widely adopted for time-series forecasting due to their high representational capacity and architectural flexibility. However, many Transformer variants implicitly assume stationarity and stable temporal dynamics -- assumptions routinely violated in financial markets characterized by regime shifts and non-stationarity. Empirically, state-of-the-art time-series Transformers often underperform even vanilla Transformers on financial tasks, while simpler architectures with distinct inductive biases, such as CNNs and RNNs, can achieve stronger performance with substantially lower complexity. At the same time, no single inductive bias dominates across markets or regimes, suggesting that robust financial forecasting requires integrating complementary temporal priors. We propose TIPS (Transformer with Inductive Prior Synthesis), a knowledge distillation framework that synthesizes diverse inductive biases -- causality, locality, and periodicity -- within a unified Transformer. TIPS trains bias-specialized Transformer teachers via attention masking, then distills their knowledge into a single student model with regime-dependent alignment across inductive biases. Across four major equity markets, TIPS achieves state-of-the-art performance, outperforming strong ensemble baselines by 55%, 9%, and 16% in annual return, Sharpe ratio, and Calmar ratio, while requiring only 38% of the inference-time computation. Further analyses show that TIPS generates statistically significant excess returns beyond both vanilla Transformers and its teacher ensembles, and exhibits regime-dependent behavioral alignment with classical architectures during their profitable periods. These results highlight the importance of regime-dependent inductive bias utilization for robust generalization in non-stationary financial time series.

Integrating Inductive Biases in Transformers via Distillation for Financial Time Series Forecasting

Abstract

Transformer-based models have been widely adopted for time-series forecasting due to their high representational capacity and architectural flexibility. However, many Transformer variants implicitly assume stationarity and stable temporal dynamics -- assumptions routinely violated in financial markets characterized by regime shifts and non-stationarity. Empirically, state-of-the-art time-series Transformers often underperform even vanilla Transformers on financial tasks, while simpler architectures with distinct inductive biases, such as CNNs and RNNs, can achieve stronger performance with substantially lower complexity. At the same time, no single inductive bias dominates across markets or regimes, suggesting that robust financial forecasting requires integrating complementary temporal priors. We propose TIPS (Transformer with Inductive Prior Synthesis), a knowledge distillation framework that synthesizes diverse inductive biases -- causality, locality, and periodicity -- within a unified Transformer. TIPS trains bias-specialized Transformer teachers via attention masking, then distills their knowledge into a single student model with regime-dependent alignment across inductive biases. Across four major equity markets, TIPS achieves state-of-the-art performance, outperforming strong ensemble baselines by 55%, 9%, and 16% in annual return, Sharpe ratio, and Calmar ratio, while requiring only 38% of the inference-time computation. Further analyses show that TIPS generates statistically significant excess returns beyond both vanilla Transformers and its teacher ensembles, and exhibits regime-dependent behavioral alignment with classical architectures during their profitable periods. These results highlight the importance of regime-dependent inductive bias utilization for robust generalization in non-stationary financial time series.
Paper Structure (69 sections, 12 equations, 6 figures, 12 tables, 1 algorithm)

This paper contains 69 sections, 12 equations, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: Performance–efficiency trade-off across generic time-series models, financial forecasting models, and classical architectures evaluated across multiple equity markets. The figure highlights substantial variation in performance and computational cost across model families, with TIPS achieving the strongest performance among the evaluated methods while maintaining low inference-time overhead.
  • Figure 2: Overview of the TIPS training framework. (a) Bias-specialized Transformer (TFM) teachers are constructed via different attention masks or positional biases (Colors indicate where the masks and biases are applied). (b) Teachers are trained independently for ranking prediction. (c) Teacher predictions are averaged and distilled into a single student model.
  • Figure 3: CSI300 Market regime segmentation
  • Figure 4: CSI500 Market regime segmentation
  • Figure 5: NI225 Market regime segmentation
  • ...and 1 more figures