Table of Contents
Fetching ...

Financial Sentiment Analysis: Leveraging Actual and Synthetic Data for Supervised Fine-tuning

Abraham Atsiwo

TL;DR

The paper tackles finance-specific sentiment analysis by addressing limited domain data and the fixed context windows of general-purpose models. It introduces two strategies: BertNSP-finance, which extends context by concatenating sentences via next-sentence-prediction training, and finbert-lc, which blends real and synthetic data to maximize context for sentiment classification. Key findings show that NSP-pretraining yields high accuracy and macro-F1 on the financial phrasebank, while finbert-lc often outperforms FINBERT and LSTM depending on agreement levels, with GPT-4 augmentation providing additional gains. The work demonstrates that synthetic data and longer-context methods can achieve state-of-the-art results with substantially fewer trainable parameters, offering a practical pathway for domain-specific sentiment analysis in finance.

Abstract

The Efficient Market Hypothesis (EMH) highlights the essence of financial news in stock price movement. Financial news comes in the form of corporate announcements, news titles, and other forms of digital text. The generation of insights from financial news can be done with sentiment analysis. General-purpose language models are too general for sentiment analysis in finance. Curated labeled data for fine-tuning general-purpose language models are scare, and existing fine-tuned models for sentiment analysis in finance do not capture the maximum context width. We hypothesize that using actual and synthetic data can improve performance. We introduce BertNSP-finance to concatenate shorter financial sentences into longer financial sentences, and finbert-lc to determine sentiment from digital text. The results show improved performance on the accuracy and the f1 score for the financial phrasebank data with $50\%$ and $100\%$ agreement levels.

Financial Sentiment Analysis: Leveraging Actual and Synthetic Data for Supervised Fine-tuning

TL;DR

The paper tackles finance-specific sentiment analysis by addressing limited domain data and the fixed context windows of general-purpose models. It introduces two strategies: BertNSP-finance, which extends context by concatenating sentences via next-sentence-prediction training, and finbert-lc, which blends real and synthetic data to maximize context for sentiment classification. Key findings show that NSP-pretraining yields high accuracy and macro-F1 on the financial phrasebank, while finbert-lc often outperforms FINBERT and LSTM depending on agreement levels, with GPT-4 augmentation providing additional gains. The work demonstrates that synthetic data and longer-context methods can achieve state-of-the-art results with substantially fewer trainable parameters, offering a practical pathway for domain-specific sentiment analysis in finance.

Abstract

The Efficient Market Hypothesis (EMH) highlights the essence of financial news in stock price movement. Financial news comes in the form of corporate announcements, news titles, and other forms of digital text. The generation of insights from financial news can be done with sentiment analysis. General-purpose language models are too general for sentiment analysis in finance. Curated labeled data for fine-tuning general-purpose language models are scare, and existing fine-tuned models for sentiment analysis in finance do not capture the maximum context width. We hypothesize that using actual and synthetic data can improve performance. We introduce BertNSP-finance to concatenate shorter financial sentences into longer financial sentences, and finbert-lc to determine sentiment from digital text. The results show improved performance on the accuracy and the f1 score for the financial phrasebank data with and agreement levels.

Paper Structure

This paper contains 12 sections, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: Token distribution of the financial phrasebank dataset.
  • Figure 2: Token distribution of the concatenated financial phrasebank dataset.
  • Figure 3: Plot of test size vs. accuracy, loss and vs. f1 macro grouped by model type (Vanilla BERT Small, Vanilla BERT Large and BertNSP-finance (PBERT NSP)).

Theorems & Definitions (2)

  • Example 1: Neutral predicted as negative / positive
  • Example 2: Misclassified Next Sentence Prediction