Table of Contents
Fetching ...

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

Yidong Jiang, Junrong Chen, Eftychia Makri, Jialin Chen, Peiwen Li, Ali Maatouk, Leandros Tassiulas, Eliot Brenner, Bing Xiang, Rex Ying

TL;DR

Fin-RATE introduces a real-world financial analytics benchmark built from SEC filings to evaluate LLMs across detailed within-disclosure reasoning, cross-company comparisons, and longitudinal tracking. The dataset comprises 15,311 document chunks from 2,472 filings, covering 43 companies across 36 industries from 2020–2025, with three QA tasks (DR-QA, EC-QA, LT-QA) and a rigorous five-stage quality-control process. Across 17 models and retrieval settings, results show substantial performance degradation when tasks require inter-document synthesis and temporal alignment, with end-to-end RAG performance heavily limited by retrieval quality rather than generation. The work also introduces a hierarchical retrieval approach that substantially improves evidence coverage and ranking, and it presents a fine-grained error taxonomy to diagnose failure modes, highlighting the fragility of finance-tuned models on cross-entity reasoning and the importance of long-context grounding for practical financial analysis.

Abstract

With increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. They do not distinguish whether errors stem from retrieval failures, generation flaws, finance-specific reasoning mistakes, or misunderstanding of the query or context. This makes it difficult to pinpoint performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirror financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This is driven by rising comparison hallucinations, time and entity mismatches, and mirrored by declines in reasoning and factuality--limitations that prior benchmarks have yet to formally categorize or quantify.

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

TL;DR

Fin-RATE introduces a real-world financial analytics benchmark built from SEC filings to evaluate LLMs across detailed within-disclosure reasoning, cross-company comparisons, and longitudinal tracking. The dataset comprises 15,311 document chunks from 2,472 filings, covering 43 companies across 36 industries from 2020–2025, with three QA tasks (DR-QA, EC-QA, LT-QA) and a rigorous five-stage quality-control process. Across 17 models and retrieval settings, results show substantial performance degradation when tasks require inter-document synthesis and temporal alignment, with end-to-end RAG performance heavily limited by retrieval quality rather than generation. The work also introduces a hierarchical retrieval approach that substantially improves evidence coverage and ranking, and it presents a fine-grained error taxonomy to diagnose failure modes, highlighting the fragility of finance-tuned models on cross-entity reasoning and the importance of long-context grounding for practical financial analysis.

Abstract

With increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. They do not distinguish whether errors stem from retrieval failures, generation flaws, finance-specific reasoning mistakes, or misunderstanding of the query or context. This makes it difficult to pinpoint performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirror financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This is driven by rising comparison hallucinations, time and entity mismatches, and mirrored by declines in reasoning and factuality--limitations that prior benchmarks have yet to formally categorize or quantify.
Paper Structure (77 sections, 3 equations, 14 figures, 22 tables)

This paper contains 77 sections, 3 equations, 14 figures, 22 tables.

Figures (14)

  • Figure 1: An overview of Fin-RATE with core tasks and corresponding evaluated capabilities.
  • Figure 2: Overview of the Fin-RATE dataset construction framework. Stage (1) builds a heterogeneous corpus from SEC filings via systematic collection, parsing, and chunking. Stage (2) generates and refines QA pairs through task-specific prompting, multi-step filtering (regex and LLM-based), human validation, and scoring point extraction.
  • Figure 3: Distribution of QA error subtypes across three task Left: Level 1 error types, covering broad categories B1–B4, C1–C4, and D1–D2. Right: Level 2 refinements of B1 (Hallucination) and D1 (Query Misunderstanding).
  • Figure 4: Radar charts showing model categories' performance across DR-QA, EC-QA, and LT-QA, with solid lines for mean and shaded areas for standard deviation. S1 = Reasoning Chain, S2 = Information Coverage, S3 = Analytical Depth, S4 = Clarity of Expression, S5 = Factual Consistency.
  • Figure 5: Retrieval error distribution. ME = Missing Evidence, SF = Sorting Failure, DE = Distractor Evidence.
  • ...and 9 more figures