Table of Contents
Fetching ...

Aligning LLMs with Human Instructions and Stock Market Feedback in Financial Sentiment Analysis

Zijie Zhao, Roy E. Welsch

TL;DR

This study introduces an adaptive retrieval augmented framework for Large Language Models (LLMs) that aligns with human instructions through Instruction Tuning and incorporates market feedback to dynamically adjust weights across various knowledge sources within the Retrieval-Augmented Generation (RAG) module.

Abstract

Financial sentiment analysis is crucial for trading and investment decision-making. This study introduces an adaptive retrieval augmented framework for Large Language Models (LLMs) that aligns with human instructions through Instruction Tuning and incorporates market feedback to dynamically adjust weights across various knowledge sources within the Retrieval-Augmented Generation (RAG) module. Building upon foundational models like LLaMA 2, we fine-tune a series of LLMs ranging from 7B to 70B in size, enriched with Instruction Tuning and RAG, and further optimized through direct feedback and Reinforcement Learning (RL)-based refinement methods applied to the source weights of RAG.Through extensive evaluation, we demonstrate that the sentiment outputs from our LLMs more accurately mirror the intrinsic sentiment of textual data, showcasing a 1% to 6% boost in accuracy and F1 score over existing state-of-the-art models and leading conversational AI systems. Moreover, the sentiments extracted are more indicative of the directions in stock price movements. On top of that, we successfully construct portfolios that yield a 3.61% higher Sharpe ratio compared to the S&P 500 baseline in bullish markets. These portfolios also demonstrate resilience in bearish markets, with a 5x reduction in return losses compared to those typically experienced by the S&P 500.

Aligning LLMs with Human Instructions and Stock Market Feedback in Financial Sentiment Analysis

TL;DR

This study introduces an adaptive retrieval augmented framework for Large Language Models (LLMs) that aligns with human instructions through Instruction Tuning and incorporates market feedback to dynamically adjust weights across various knowledge sources within the Retrieval-Augmented Generation (RAG) module.

Abstract

Financial sentiment analysis is crucial for trading and investment decision-making. This study introduces an adaptive retrieval augmented framework for Large Language Models (LLMs) that aligns with human instructions through Instruction Tuning and incorporates market feedback to dynamically adjust weights across various knowledge sources within the Retrieval-Augmented Generation (RAG) module. Building upon foundational models like LLaMA 2, we fine-tune a series of LLMs ranging from 7B to 70B in size, enriched with Instruction Tuning and RAG, and further optimized through direct feedback and Reinforcement Learning (RL)-based refinement methods applied to the source weights of RAG.Through extensive evaluation, we demonstrate that the sentiment outputs from our LLMs more accurately mirror the intrinsic sentiment of textual data, showcasing a 1% to 6% boost in accuracy and F1 score over existing state-of-the-art models and leading conversational AI systems. Moreover, the sentiments extracted are more indicative of the directions in stock price movements. On top of that, we successfully construct portfolios that yield a 3.61% higher Sharpe ratio compared to the S&P 500 baseline in bullish markets. These portfolios also demonstrate resilience in bearish markets, with a 5x reduction in return losses compared to those typically experienced by the S&P 500.

Paper Structure

This paper contains 24 sections, 4 equations, 6 figures, 5 tables, 3 algorithms.

Figures (6)

  • Figure 1: Workflow of financial sentiment analysis using LLMs.
  • Figure 2: Family of financial LLMs based on the LLaMA 2. (a) LLaMA I, (b) LLaMA I-RAG, (c) LLaMA I-RAG-DF, (d) LLaMA I-RAG-RL.
  • Figure 3: The impact of model size on weighted F1 score across test datasets. The length of the error bar extending from this central point throughout the paper represents the standard deviation calculated across ten independent experiments, each with a different random seed. The grey horizontal line indicates the weighted F-1 score obtained by GPT-4 Turbo.
  • Figure 4: Weights distribution across different knowledge sources of RAG. The gray horizontal line represents the uniform initialization level (12.5%) for the eight evaluated knowledge sources.
  • Figure 5: Cumulative return curves of different investment strategies and S&P 500. Values are computed as the mean of ten independent training experiments, each with a different random seed.
  • ...and 1 more figures