Table of Contents
Fetching ...

Real-Time Online Stock Forecasting Utilizing Integrated Quantitative and Qualitative Analysis

Sai Akash Bathini, Dagli Cihan

TL;DR

This work tackles real-time stock forecasting by fusing quantitative signals with qualitative sentiment data through a publicly available Huge-Stock-Dataset covering eight stocks plus the DJIA from 2018–2022. It combines 45 technical indicators, ~220 quantitative fundamental features, and rich qualitative signals derived from archives, news, and social media, enabling incremental online learning with low-latency data collection. A thorough evaluation of language-model approaches reveals DistilRoBERTa as a strong performer for finance-domain sentiment, with domain-adaptive pretraining further enhancing predictive signals, and Spearman correlations between sentiment and returns reaching substantial levels (e.g., up to $r_s \approx 0.63$ in some categories). The dataset and analyses offer a practical resource for researchers and practitioners to build real-time, sentiment-aware stock forecasting models and to benchmark multi-source, cross-domain data fusion in finance.

Abstract

The application of Machine learning to finance has become a familiar approach, even more so in stock market forecasting. The stock market is highly volatile, and huge amounts of data are generated every minute globally. The extraction of effective intelligence from this data is of critical importance. However, a collaboration of numerical stock data with qualitative text data can be a challenging task. In this work, we accomplish this by providing an unprecedented, publicly available dataset with technical and fundamental data and sentiment that we gathered from news archives, TV news captions, radio transcripts, tweets, daily financial newspapers, etc. The text data entries used for sentiment extraction total more than 1.4 Million. The dataset consists of daily entries from January 2018 to December 2022 for eight companies representing diverse industrial sectors and the Dow Jones Industrial Average (DJIA) as a whole. Holistic Fundamental and Technical data is provided training ready for Model learning and deployment. Most importantly, the data generated could be used for incremental online learning with real-time data points retrieved daily since no stagnant data was utilized. All the data was retired from APIs or self-designed robust information retrieval technologies with extremely low latency and zero monetary cost. These adaptable technologies facilitate data extraction for any stock. Moreover, the utilization of Spearman's rank correlation over real-time data, linking stock returns with sentiment analysis has produced noteworthy results for the DJIA and the eight other stocks, achieving accuracy levels surpassing 60%. The dataset is made available at https://github.com/batking24/Huge-Stock-Dataset.

Real-Time Online Stock Forecasting Utilizing Integrated Quantitative and Qualitative Analysis

TL;DR

This work tackles real-time stock forecasting by fusing quantitative signals with qualitative sentiment data through a publicly available Huge-Stock-Dataset covering eight stocks plus the DJIA from 2018–2022. It combines 45 technical indicators, ~220 quantitative fundamental features, and rich qualitative signals derived from archives, news, and social media, enabling incremental online learning with low-latency data collection. A thorough evaluation of language-model approaches reveals DistilRoBERTa as a strong performer for finance-domain sentiment, with domain-adaptive pretraining further enhancing predictive signals, and Spearman correlations between sentiment and returns reaching substantial levels (e.g., up to in some categories). The dataset and analyses offer a practical resource for researchers and practitioners to build real-time, sentiment-aware stock forecasting models and to benchmark multi-source, cross-domain data fusion in finance.

Abstract

The application of Machine learning to finance has become a familiar approach, even more so in stock market forecasting. The stock market is highly volatile, and huge amounts of data are generated every minute globally. The extraction of effective intelligence from this data is of critical importance. However, a collaboration of numerical stock data with qualitative text data can be a challenging task. In this work, we accomplish this by providing an unprecedented, publicly available dataset with technical and fundamental data and sentiment that we gathered from news archives, TV news captions, radio transcripts, tweets, daily financial newspapers, etc. The text data entries used for sentiment extraction total more than 1.4 Million. The dataset consists of daily entries from January 2018 to December 2022 for eight companies representing diverse industrial sectors and the Dow Jones Industrial Average (DJIA) as a whole. Holistic Fundamental and Technical data is provided training ready for Model learning and deployment. Most importantly, the data generated could be used for incremental online learning with real-time data points retrieved daily since no stagnant data was utilized. All the data was retired from APIs or self-designed robust information retrieval technologies with extremely low latency and zero monetary cost. These adaptable technologies facilitate data extraction for any stock. Moreover, the utilization of Spearman's rank correlation over real-time data, linking stock returns with sentiment analysis has produced noteworthy results for the DJIA and the eight other stocks, achieving accuracy levels surpassing 60%. The dataset is made available at https://github.com/batking24/Huge-Stock-Dataset.
Paper Structure (24 sections, 4 equations, 3 figures, 7 tables)