Table of Contents
Fetching ...

DiffsFormer: A Diffusion Transformer on Stock Factor Augmentation

Yuan Gao, Haokun Chen, Xiang Wang, Zhicai Wang, Xue Wang, Jinyang Gao, Bolin Ding

TL;DR

This work tackles data scarcity in stock forecasting by introducing DiffsFormer, a diffusion-based Transformer that augments stock factors with AI-generated samples from a large source domain and refines them for target tasks via an editing step $T'\ll T$. It supports both predictor-guided and predictor-free conditional diffusion, and leverages transfer learning to distill knowledge from broader markets, improving robustness against low SNR and data homogeneity. Empirical results on CSI300 and CSI800 across eight baselines show consistent improvements in annualized return ratio and related metrics, while analyses reveal benefits in data fidelity, diversity, and reduced volatility through loss-guided diffusion and efficient sampling. The approach demonstrates the practicality of diffusion-based data augmentation for finance, offering a scalable, plug-in module that enhances existing forecasting backbones and mitigates data scarcity with tangible performance gains.

Abstract

Machine learning models have demonstrated remarkable efficacy and efficiency in a wide range of stock forecasting tasks. However, the inherent challenges of data scarcity, including low signal-to-noise ratio (SNR) and data homogeneity, pose significant obstacles to accurate forecasting. To address this issue, we propose a novel approach that utilizes artificial intelligence-generated samples (AIGS) to enhance the training procedures. In our work, we introduce the Diffusion Model to generate stock factors with Transformer architecture (DiffsFormer). DiffsFormer is initially trained on a large-scale source domain, incorporating conditional guidance so as to capture global joint distribution. When presented with a specific downstream task, we employ DiffsFormer to augment the training procedure by editing existing samples. This editing step allows us to control the strength of the editing process, determining the extent to which the generated data deviates from the target domain. To evaluate the effectiveness of DiffsFormer augmented training, we conduct experiments on the CSI300 and CSI800 datasets, employing eight commonly used machine learning models. The proposed method achieves relative improvements of 7.2% and 27.8% in annualized return ratio for the respective datasets. Furthermore, we perform extensive experiments to gain insights into the functionality of DiffsFormer and its constituent components, elucidating how they address the challenges of data scarcity and enhance the overall model performance. Our research demonstrates the efficacy of leveraging AIGS and the DiffsFormer architecture to mitigate data scarcity in stock forecasting tasks.

DiffsFormer: A Diffusion Transformer on Stock Factor Augmentation

TL;DR

This work tackles data scarcity in stock forecasting by introducing DiffsFormer, a diffusion-based Transformer that augments stock factors with AI-generated samples from a large source domain and refines them for target tasks via an editing step . It supports both predictor-guided and predictor-free conditional diffusion, and leverages transfer learning to distill knowledge from broader markets, improving robustness against low SNR and data homogeneity. Empirical results on CSI300 and CSI800 across eight baselines show consistent improvements in annualized return ratio and related metrics, while analyses reveal benefits in data fidelity, diversity, and reduced volatility through loss-guided diffusion and efficient sampling. The approach demonstrates the practicality of diffusion-based data augmentation for finance, offering a scalable, plug-in module that enhances existing forecasting backbones and mitigates data scarcity with tangible performance gains.

Abstract

Machine learning models have demonstrated remarkable efficacy and efficiency in a wide range of stock forecasting tasks. However, the inherent challenges of data scarcity, including low signal-to-noise ratio (SNR) and data homogeneity, pose significant obstacles to accurate forecasting. To address this issue, we propose a novel approach that utilizes artificial intelligence-generated samples (AIGS) to enhance the training procedures. In our work, we introduce the Diffusion Model to generate stock factors with Transformer architecture (DiffsFormer). DiffsFormer is initially trained on a large-scale source domain, incorporating conditional guidance so as to capture global joint distribution. When presented with a specific downstream task, we employ DiffsFormer to augment the training procedure by editing existing samples. This editing step allows us to control the strength of the editing process, determining the extent to which the generated data deviates from the target domain. To evaluate the effectiveness of DiffsFormer augmented training, we conduct experiments on the CSI300 and CSI800 datasets, employing eight commonly used machine learning models. The proposed method achieves relative improvements of 7.2% and 27.8% in annualized return ratio for the respective datasets. Furthermore, we perform extensive experiments to gain insights into the functionality of DiffsFormer and its constituent components, elucidating how they address the challenges of data scarcity and enhance the overall model performance. Our research demonstrates the efficacy of leveraging AIGS and the DiffsFormer architecture to mitigate data scarcity in stock forecasting tasks.
Paper Structure (40 sections, 12 equations, 13 figures, 9 tables, 2 algorithms)

This paper contains 40 sections, 12 equations, 13 figures, 9 tables, 2 algorithms.

Figures (13)

  • Figure 1: (a) Pearson Correlation Coefficients between return ratio and stock factors are low. (b) Average number of large price drop stocks split by sectors. Stocks within the same industry sector tend to perform similarly.
  • Figure 2: An illustration of DiffsFormer. F refers to "factors", such as the open, close, lowest, and highest prices of a stock.
  • Figure 3: Illustration of the editing step.
  • Figure 4: The training and the editing topology.
  • Figure 5: The $R^{2}$ score between the generated and the original factors and label. $R^{2}$ score is the square of the Pearson Correlation. The blue bars represent the $R^{2}$ scores of 158 factors, while the red bar shows the $R^{2}$ score of the label.
  • ...and 8 more figures