Table of Contents
Fetching ...

Contrastive Similarity Learning for Market Forecasting: The ContraSim Framework

Nicholas Vinden, Raeid Saqur, Zining Zhu, Frank Rudzicz

TL;DR

ContraSim presents a self-supervised framework that learns a semantically structured embedding space for daily financial headlines by generating augmented DNS with a continuous similarity score and training with Weighted Self-Supervised Contrastive Learning. The approach enables inter-day comparisons to find historical analogs and improves market-movement forecasting when combined with LLM-based representations. Empirical results show meaningful gains on NIFTY-SFT and IMDB datasets, along with improved information-density metrics indicating that the embedding space captures market-dynamics signals without using ground-truth labels for clustering. The work advances interpretable, semantically grounded text representations for financial forecasting, with potential applicability to multiple domains and real-time decision support for analysts.

Abstract

We introduce the Contrastive Similarity Space Embedding Algorithm (ContraSim), a novel framework for uncovering the global semantic relationships between daily financial headlines and market movements. ContraSim operates in two key stages: (I) Weighted Headline Augmentation, which generates augmented financial headlines along with a semantic fine-grained similarity score, and (II) Weighted Self-Supervised Contrastive Learning (WSSCL), an extended version of classical self-supervised contrastive learning that uses the similarity metric to create a refined weighted embedding space. This embedding space clusters semantically similar headlines together, facilitating deeper market insights. Empirical results demonstrate that integrating ContraSim features into financial forecasting tasks improves classification accuracy from WSJ headlines by 7%. Moreover, leveraging an information density analysis, we find that the similarity spaces constructed by ContraSim intrinsically cluster days with homogeneous market movement directions, indicating that ContraSim captures market dynamics independent of ground truth labels. Additionally, ContraSim enables the identification of historical news days that closely resemble the headlines of the current day, providing analysts with actionable insights to predict market trends by referencing analogous past events.

Contrastive Similarity Learning for Market Forecasting: The ContraSim Framework

TL;DR

ContraSim presents a self-supervised framework that learns a semantically structured embedding space for daily financial headlines by generating augmented DNS with a continuous similarity score and training with Weighted Self-Supervised Contrastive Learning. The approach enables inter-day comparisons to find historical analogs and improves market-movement forecasting when combined with LLM-based representations. Empirical results show meaningful gains on NIFTY-SFT and IMDB datasets, along with improved information-density metrics indicating that the embedding space captures market-dynamics signals without using ground-truth labels for clustering. The work advances interpretable, semantically grounded text representations for financial forecasting, with potential applicability to multiple domains and real-time decision support for analysts.

Abstract

We introduce the Contrastive Similarity Space Embedding Algorithm (ContraSim), a novel framework for uncovering the global semantic relationships between daily financial headlines and market movements. ContraSim operates in two key stages: (I) Weighted Headline Augmentation, which generates augmented financial headlines along with a semantic fine-grained similarity score, and (II) Weighted Self-Supervised Contrastive Learning (WSSCL), an extended version of classical self-supervised contrastive learning that uses the similarity metric to create a refined weighted embedding space. This embedding space clusters semantically similar headlines together, facilitating deeper market insights. Empirical results demonstrate that integrating ContraSim features into financial forecasting tasks improves classification accuracy from WSJ headlines by 7%. Moreover, leveraging an information density analysis, we find that the similarity spaces constructed by ContraSim intrinsically cluster days with homogeneous market movement directions, indicating that ContraSim captures market dynamics independent of ground truth labels. Additionally, ContraSim enables the identification of historical news days that closely resemble the headlines of the current day, providing analysts with actionable insights to predict market trends by referencing analogous past events.

Paper Structure

This paper contains 39 sections, 7 equations, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of our proposed Contrastive Similarity (ContraSim) embedding approach. In training, we use a LLaMA chat model to generate augmented financial news headlines with varying degrees of semantic similarity to the original. We then use a Weighted Self-Supervised Contrastive Learning (WSSCL) approach to create an embedding space that clusters semantically similar prompts closer together. In deployment, the embeddings from the similarity space, can be used to i) Make better predictions on the direction of today's stock movement, ii) Find the most similar financial news to today's.
  • Figure 2: Distribution of similarity scores for augmented headlines across different augmentation actions. Each histogram represents the frequency distribution of similarity scores produced by the quality monitoring system for a specific augmentation type: (a) Negated Headlines, showing a concentration of scores in the low similarity range ($[0, 0.33]$); (b) Semantically-Shifted Headlines, with scores distributed in the mid-range ($[0.33, 0.66]$); and (c) Rephrased Headlines, exhibiting high similarity scores ($[0.66, 1.00]$). These distributions validate that the augmentations align with their intended semantic similarity thresholds.
  • Figure 3: Breaking down the instruction or prompt prefix, and market context components of a prompt, $x_p$.