Table of Contents
Fetching ...

CSMD: Curated Multimodal Dataset for Chinese Stock Analysis

Yu Liu, Zhuoying Li, Ruifeng Yang, Fengran Mo, Cen Chen

TL;DR

CSMD addresses the lack of Chinese-language multimodal stock data by curating price data for major indices with aligned financial news, and by applying LLM-guided factor extraction to improve interpretability. The authors also provide LightQuant, a lightweight framework that unifies data processing and backtesting for rapid prototyping. Empirical results show that CSMD, with both single- and multimodal models, outperforms existing English-language or US-centric datasets such as CMIN-CN on stock trend prediction, and backtesting demonstrates practical trading metrics. The work thus offers a practical, scalable solution for multimodal forecasting in the Chinese stock market and lowers barriers to entry for researchers and practitioners.

Abstract

The stock market is a complex and dynamic system, where it is non-trivial for researchers and practitioners to uncover underlying patterns and forecast stock movements. The existing studies for stock market analysis rely on leveraging various types of information to extract useful factors, which are highly conditional on the quality of the data used. However, the currently available resources are mainly based on the U.S. stock market in English, which is inapplicable to adapt to other countries. To address these issues, we propose CSMD, a multimodal dataset curated specifically for analyzing the Chinese stock market with meticulous processing for validated quality. In addition, we develop a lightweight and user-friendly framework LightQuant for researchers and practitioners with expertise in financial domains. Experimental results on top of our datasets and framework with various backbone models demonstrate their effectiveness compared with using existing datasets. The datasets and code are publicly available at the link: https://github.com/ECNU-CILAB/LightQuant.

CSMD: Curated Multimodal Dataset for Chinese Stock Analysis

TL;DR

CSMD addresses the lack of Chinese-language multimodal stock data by curating price data for major indices with aligned financial news, and by applying LLM-guided factor extraction to improve interpretability. The authors also provide LightQuant, a lightweight framework that unifies data processing and backtesting for rapid prototyping. Empirical results show that CSMD, with both single- and multimodal models, outperforms existing English-language or US-centric datasets such as CMIN-CN on stock trend prediction, and backtesting demonstrates practical trading metrics. The work thus offers a practical, scalable solution for multimodal forecasting in the Chinese stock market and lowers barriers to entry for researchers and practitioners.

Abstract

The stock market is a complex and dynamic system, where it is non-trivial for researchers and practitioners to uncover underlying patterns and forecast stock movements. The existing studies for stock market analysis rely on leveraging various types of information to extract useful factors, which are highly conditional on the quality of the data used. However, the currently available resources are mainly based on the U.S. stock market in English, which is inapplicable to adapt to other countries. To address these issues, we propose CSMD, a multimodal dataset curated specifically for analyzing the Chinese stock market with meticulous processing for validated quality. In addition, we develop a lightweight and user-friendly framework LightQuant for researchers and practitioners with expertise in financial domains. Experimental results on top of our datasets and framework with various backbone models demonstrate their effectiveness compared with using existing datasets. The datasets and code are publicly available at the link: https://github.com/ECNU-CILAB/LightQuant.

Paper Structure

This paper contains 11 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: This figure illustrates the application of a LLM to denoise news texts concerning Yili Co., Ltd. from November 22, 2023, yielding factors with high readability and interpretability. The extracted factors are consistent with the stock's actual movement on the subsequent trading day. Note that the original text is in Chinese and has been translated for illustrative purposes.
  • Figure 2: The overall framework of our LightQuant.