Table of Contents
Fetching ...

NIFTY Financial News Headlines Dataset

Raeid Saqur, Ken Kato, Nicholas Vinden, Frank Rudzicz

TL;DR

NIFTY introduces a public, headline-informed dataset for financial market forecasting using large language models, structured for both SFT (NIFTY-LM) and RLHF-style alignment (NIFTY-RL). It combines long-span news data with market context and concrete label definitions, enabling POMDP-based framing and regime-switching studies. The work provides baseline SM task results and embedding analyses showing larger models yield richer semantic representations that improve downstream forecasting, with practical impact for research in finance-focused NLP and RL. By releasing datastructures and prompts tailored to modern LLM frameworks, NIFTY facilitates rapid experimentation and cross-institutional benchmarking in financial NLP and decision-making under uncertainty.

Abstract

We introduce and make publicly available the NIFTY Financial News Headlines dataset, designed to facilitate and advance research in financial market forecasting using large language models (LLMs). This dataset comprises two distinct versions tailored for different modeling approaches: (i) NIFTY-LM, which targets supervised fine-tuning (SFT) of LLMs with an auto-regressive, causal language-modeling objective, and (ii) NIFTY-RL, formatted specifically for alignment methods (like reinforcement learning from human feedback (RLHF)) to align LLMs via rejection sampling and reward modeling. Each dataset version provides curated, high-quality data incorporating comprehensive metadata, market indices, and deduplicated financial news headlines systematically filtered and ranked to suit modern LLM frameworks. We also include experiments demonstrating some applications of the dataset in tasks like stock price movement and the role of LLM embeddings in information acquisition/richness. The NIFTY dataset along with utilities (like truncating prompt's context length systematically) are available on Hugging Face at https://huggingface.co/datasets/raeidsaqur/NIFTY.

NIFTY Financial News Headlines Dataset

TL;DR

NIFTY introduces a public, headline-informed dataset for financial market forecasting using large language models, structured for both SFT (NIFTY-LM) and RLHF-style alignment (NIFTY-RL). It combines long-span news data with market context and concrete label definitions, enabling POMDP-based framing and regime-switching studies. The work provides baseline SM task results and embedding analyses showing larger models yield richer semantic representations that improve downstream forecasting, with practical impact for research in finance-focused NLP and RL. By releasing datastructures and prompts tailored to modern LLM frameworks, NIFTY facilitates rapid experimentation and cross-institutional benchmarking in financial NLP and decision-making under uncertainty.

Abstract

We introduce and make publicly available the NIFTY Financial News Headlines dataset, designed to facilitate and advance research in financial market forecasting using large language models (LLMs). This dataset comprises two distinct versions tailored for different modeling approaches: (i) NIFTY-LM, which targets supervised fine-tuning (SFT) of LLMs with an auto-regressive, causal language-modeling objective, and (ii) NIFTY-RL, formatted specifically for alignment methods (like reinforcement learning from human feedback (RLHF)) to align LLMs via rejection sampling and reward modeling. Each dataset version provides curated, high-quality data incorporating comprehensive metadata, market indices, and deduplicated financial news headlines systematically filtered and ranked to suit modern LLM frameworks. We also include experiments demonstrating some applications of the dataset in tasks like stock price movement and the role of LLM embeddings in information acquisition/richness. The NIFTY dataset along with utilities (like truncating prompt's context length systematically) are available on Hugging Face at https://huggingface.co/datasets/raeidsaqur/NIFTY.
Paper Structure (47 sections, 9 equations, 6 figures, 10 tables)

This paper contains 47 sections, 9 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: A snapshot of the 'news' key value on date: 2020-02-06, at the upstart of the global coronavirus epidemic. Our $\pi_{LM}$ policy's prompt is composed of task instruction as query prefix, market context, and this news value concatenated: $s.t.$$x_p \leftarrow (x_{instruction}; x_{context}; x_{news})$. The semantic text colors red, and green conveys negative and positive sentiments. The day's market relevant news was dominated by mostly negative sentiments.
  • Figure 2: Breaking down the instruction or prompt prefix, and market context components of a prompt, $x_p$.
  • Figure 3: (a-c): Visualizations of 2D t-SNE projections of embedded clusters (using HDBSCAN with minimum cluster size of 10) for models GPT2-[SMALL, MEDIUM, LARGE]. Each datapoint is an embedding of a news headline with a location tag in [U.S, Europe, Asia, Middle East, Latin America]. Each colour is associated with a cluster of headlines. The background purple hue are datapoints belonging to the outlier cluster. (d): Information gain added when clustering model embeddings together on the headline location task. Information gain increases with number of model parameters. Pattern persists across model architectures: GPT2 models are shown in blue, BERT models in red, and T5 models in green.
  • Figure 4: Information Gain in Clustered Prompt Embeddings (IG-CluPE): A novel method of measuring a LLM's ability to capture rich semantic contextualization of a corpus of text prompts with corresponding classifications. Prompt embeddings are extracted from outputs of the last-hidden-layer of transformer models to create an embedding space optimized for linear separability of points from each class. The effectiveness of a model's ability to group points with similar features together is measured through t-SNE clustering and information gain.
  • Figure 5: Reduction in variance (a) and information gain (b-c) added when clustering model embeddings together on the market movement, location, and genre tasks. Multiple sizes of GPT2 (blue), T5 (green), and BERT (red) models are plotted with trend line showing increase in parameter count leading to higher clustered reduction in variance and information gain. Strong correlations between parameter count and information gain are shown for all 3 model types in the location and genre tasks. In the market movement task, variance is reduced when parameter counts are increased for the GPT2 and BERT models, but not for T5 models. Although not shown in (a-c), due to having undisclosed parameter counts, OPENAI-LARGE outperformed OPENAI-SMALL in each task. All results are available in Table \ref{['table:embedding_results']}.
  • ...and 1 more figures