Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

Md Talha Mohsin

Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

Md Talha Mohsin

TL;DR

This study tackles how transformer-based LLMs perform in analyzing high-stakes financial disclosures by evaluating five models on the Business sections of 10-K filings using a multi-faceted framework that combines human judgments, automated similarity metrics, and behavioral diagnostics under standardized prompts. It reveals that no model consistently dominates across qualitative, lexical, semantic, and stability dimensions; inter-rater reliability is generally low, highlighting substantial subjectivity in financial interpretation. The findings underscore that apparent performance differences are contextual tendencies rather than universal reliability, and emphasize the need for evaluation frameworks that account for human disagreement and interpretability for finance applications. The work provides a replicable benchmark and guidance for deploying LLMs in finance, while acknowledging limitations and proposing directions for broader data, domain-expert evaluation, and causal interpretability to improve reliability and transparency in financial NLP. Key insights include the trade-offs between lexical overlap and semantic alignment across models, the importance of cross-model consistency for longitudinal analyses, and the reminder that surface-level similarity may not imply true interpretive alignment in financial contexts.

Abstract

Large language models (LLMs) are increasingly used to support the analysis of complex financial disclosures, yet their reliability, behavioral consistency, and transparency remain insufficiently understood in high-stakes settings. This paper presents a controlled evaluation of five transformer-based LLMs applied to question answering over the Business sections of U.S. 10-K filings. To capture complementary aspects of model behavior, we combine human evaluation, automated similarity metrics, and behavioral diagnostics under standardized and context-controlled prompting conditions. Human assessments indicate that models differ in their average performance across qualitative dimensions such as relevance, completeness, clarity, conciseness, and factual accuracy, though inter-rater agreement is modest, reflecting the subjective nature of these criteria. Automated metrics reveal systematic differences in lexical overlap and semantic similarity across models, while behavioral diagnostics highlight variation in response stability and cross-prompt alignment. Importantly, no single model consistently dominates across all evaluation perspectives. Together, these findings suggest that apparent performance differences should be interpreted as relative tendencies under the tested conditions rather than definitive indicators of general reliability. The results underscore the need for evaluation frameworks that account for human disagreement, behavioral variability, and interpretability when deploying LLMs in financially consequential applications.

Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

TL;DR

Abstract

Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)