Table of Contents
Fetching ...

Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts

Nikesh Gyawali, Doina Caragea, Alex Vasenkov, Cornelia Caragea

TL;DR

This work introduces a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales, and conducts a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought prompting strategies.

Abstract

Financial narratives from U.S. Securities and Exchange Commission (SEC) filing reports and quarterly earnings call transcripts (ECTs) are very important for investors, auditors, and regulators. However, their length, financial jargon, and nuanced language make fine-grained analysis difficult. Prior sentiment analysis in the financial domain required a large, expensive labeled dataset, making the sentence-level stance towards specific financial targets challenging. In this work, we introduce a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales. The sentences were extracted from Form 10-K annual reports and ECTs, and labeled for stance (positive, negative, neutral) using the advanced ChatGPT-o3-pro model under rigorous human validation. Using this corpus, we conduct a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought (CoT) prompting strategies. Our results show that few-shot with CoT prompting performs best compared to supervised baselines, and LLMs' performance varies across the SEC and ECT datasets. Our findings highlight the practical viability of leveraging LLMs for target-specific stance in the financial domain without requiring extensive labeled data.

Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts

TL;DR

This work introduces a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales, and conducts a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought prompting strategies.

Abstract

Financial narratives from U.S. Securities and Exchange Commission (SEC) filing reports and quarterly earnings call transcripts (ECTs) are very important for investors, auditors, and regulators. However, their length, financial jargon, and nuanced language make fine-grained analysis difficult. Prior sentiment analysis in the financial domain required a large, expensive labeled dataset, making the sentence-level stance towards specific financial targets challenging. In this work, we introduce a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales. The sentences were extracted from Form 10-K annual reports and ECTs, and labeled for stance (positive, negative, neutral) using the advanced ChatGPT-o3-pro model under rigorous human validation. Using this corpus, we conduct a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought (CoT) prompting strategies. Our results show that few-shot with CoT prompting performs best compared to supervised baselines, and LLMs' performance varies across the SEC and ECT datasets. Our findings highlight the practical viability of leveraging LLMs for target-specific stance in the financial domain without requiring extensive labeled data.

Paper Structure

This paper contains 25 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Zero-shot accuracy of models on the SEC and ECT datasets. The top row (Panel A) presents results with chain-of-thought (CoT) prompting, and the bottom row (Panel B) presents results without CoT prompting. Each condition is evaluated across three transcript-usage scenarios: (a) no transcript context, (b) full transcript context, and (c) summarized context.
  • Figure 2: Context usage scenarios across different models on two datasets. Few-shot classification accuracy is shown for the ECT dataset (top row) and the SEC dataset (bottom row) under three context-usage scenarios: (a) no context, (b) full context, and (c) summarized context, across four models. Columns, from left to right, represent GPT-4.1-Mini, Gemma3-27B, Llama3-70B, and Mistral-24B. The variable $k$ indicates the number of most similar examples with a chain-of-thought demonstration per class.
  • Figure 3: Few-shot performance of transcript usage scenarios with and without chain-of-thought (CoT) prompts. Accuracy is shown for four models--(a) GPT-4.1-mini, (b) LLaMA3.3:70B, (c) Gemma3:24B, and (d) Mistral:24B. The result is averaged for both SEC and ECT data.
  • Figure 4: Few-shot with chain-of-thought accuracy on two datasets across various targets. Few-shot classification accuracy on the ECT dataset (top row) and the SEC dataset (bottom row) using chain-of-thought prompting for three targets—debt (left), EPS (centre), and sales (right). $k$ represents the number of most similar examples with a chain-of-thought demonstration per class. Error bars represent the standard deviation over three independent runs.