Table of Contents
Fetching ...

Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance

Dominick Kubica, Dylan T. Gordon, Nanami Emura, Derleen Saini, Charlie Goldenberg

TL;DR

Problem: financial sentiment analysis is challenged by hedging and sector-specific jargon; Approach: benchmark LLMs against traditional NLP on a standardized finance dataset and apply to Microsoft earnings transcripts with business-line segmentation; Contributions: LLMs outperform traditional sentiment engines but struggle to reach 85% accuracy, and segment-level analysis reveals nuanced patterns that correlate with investor reactions; Impact: findings inform enterprise deployment, prompting improvements in transparency, structured-data handling, and collaboration with domain experts to extract actionable insights from financial text.

Abstract

As of 2025, Generative Artificial Intelligence (GenAI) has become a central tool for productivity across industries. Beyond text generation, GenAI now plays a critical role in coding, data analysis, and research workflows. As large language models (LLMs) continue to evolve, it is essential to assess the reliability and accuracy of their outputs, especially in specialized, high-stakes domains like finance. Most modern LLMs transform text into numerical vectors, which are used in operations such as cosine similarity searches to generate responses. However, this abstraction process can lead to misinterpretation of emotional tone, particularly in nuanced financial contexts. While LLMs generally excel at identifying sentiment in everyday language, these models often struggle with the nuanced, strategically ambiguous language found in earnings call transcripts. Financial disclosures frequently embed sentiment in hedged statements, forward-looking language, and industry-specific jargon, making it difficult even for human analysts to interpret consistently, let alone AI models. This paper presents findings from the Santa Clara Microsoft Practicum Project, led by Professor Charlie Goldenberg, which benchmarks the performance of Microsoft's Copilot, OpenAI's ChatGPT, Google's Gemini, and traditional machine learning models for sentiment analysis of financial text. Using Microsoft earnings call transcripts, the analysis assesses how well LLM-derived sentiment correlates with market sentiment and stock movements and evaluates the accuracy of model outputs. Prompt engineering techniques are also examined to improve sentiment analysis results. Visualizations of sentiment consistency are developed to evaluate alignment between tone and stock performance, with sentiment trends analyzed across Microsoft's lines of business to determine which segments exert the greatest influence.

Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance

TL;DR

Problem: financial sentiment analysis is challenged by hedging and sector-specific jargon; Approach: benchmark LLMs against traditional NLP on a standardized finance dataset and apply to Microsoft earnings transcripts with business-line segmentation; Contributions: LLMs outperform traditional sentiment engines but struggle to reach 85% accuracy, and segment-level analysis reveals nuanced patterns that correlate with investor reactions; Impact: findings inform enterprise deployment, prompting improvements in transparency, structured-data handling, and collaboration with domain experts to extract actionable insights from financial text.

Abstract

As of 2025, Generative Artificial Intelligence (GenAI) has become a central tool for productivity across industries. Beyond text generation, GenAI now plays a critical role in coding, data analysis, and research workflows. As large language models (LLMs) continue to evolve, it is essential to assess the reliability and accuracy of their outputs, especially in specialized, high-stakes domains like finance. Most modern LLMs transform text into numerical vectors, which are used in operations such as cosine similarity searches to generate responses. However, this abstraction process can lead to misinterpretation of emotional tone, particularly in nuanced financial contexts. While LLMs generally excel at identifying sentiment in everyday language, these models often struggle with the nuanced, strategically ambiguous language found in earnings call transcripts. Financial disclosures frequently embed sentiment in hedged statements, forward-looking language, and industry-specific jargon, making it difficult even for human analysts to interpret consistently, let alone AI models. This paper presents findings from the Santa Clara Microsoft Practicum Project, led by Professor Charlie Goldenberg, which benchmarks the performance of Microsoft's Copilot, OpenAI's ChatGPT, Google's Gemini, and traditional machine learning models for sentiment analysis of financial text. Using Microsoft earnings call transcripts, the analysis assesses how well LLM-derived sentiment correlates with market sentiment and stock movements and evaluates the accuracy of model outputs. Prompt engineering techniques are also examined to improve sentiment analysis results. Visualizations of sentiment consistency are developed to evaluate alignment between tone and stock performance, with sentiment trends analyzed across Microsoft's lines of business to determine which segments exert the greatest influence.

Paper Structure

This paper contains 6 sections, 4 figures.

Figures (4)

  • Figure 1: Overall Sentiment Analysis Performance (First 250 Rows)
  • Figure 2: Condensed Sentiment Accuracy Comparison (First 250 Rows)
  • Figure 3: Positive Sentiment by Business Line - ChatGPT
  • Figure 4: SHAP Beeswarm: Effect of Net Sentiment on Stock Prediction