Table of Contents
Fetching ...

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Jian-Yun Nie

TL;DR

FinAuditing tackles the challenge of auditing complex GAAP and XBRL documents by introducing the first taxonomy-aligned, structure-aware benchmark built from real US-GAAP filings. It defines three complementary subtasks—FinSM for semantic consistency, FinRE for relational reasoning, and FinMR for numerical correctness—within a unified evaluation framework. Zero-shot experiments across 13 SOTA LLMs reveal substantial gaps in semantic retrieval, hierarchical understanding, and multi-step numerical reasoning over interconnected documents, underscoring limitations in current models. The authors release the dataset and evaluation code on Hugging Face, providing a foundation for developing trustworthy, structure-aware financial intelligence systems.

Abstract

The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

TL;DR

FinAuditing tackles the challenge of auditing complex GAAP and XBRL documents by introducing the first taxonomy-aligned, structure-aware benchmark built from real US-GAAP filings. It defines three complementary subtasks—FinSM for semantic consistency, FinRE for relational reasoning, and FinMR for numerical correctness—within a unified evaluation framework. Zero-shot experiments across 13 SOTA LLMs reveal substantial gaps in semantic retrieval, hierarchical understanding, and multi-step numerical reasoning over interconnected documents, underscoring limitations in current models. The authors release the dataset and evaluation code on Hugging Face, providing a foundation for developing trustworthy, structure-aware financial intelligence systems.

Abstract

The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.

Paper Structure

This paper contains 41 sections, 16 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Trends in the proportion of financial restatements among US companies from 2014 to 2024, categorized by reissuance and revision restatements. Data adapted from the Financial Times report, "Accounting errors force US companies to pull statements in record numbers" (Dec 9, 2024).
  • Figure 2: Overview of the FinAuditing20,20,18510,180,85 benchmark framework for evaluating error detection on XBRL filings across three tasks.
  • Figure 3: The F1-score (%) for individual relation type under the zero-shot settings on the FinRE task.
  • Figure 4: The accuracy (%) under the zero-shot settings on the FinMR task.
  • Figure 5: The error-rate results (%) for the FinMR task, where SER denotes the structural error rate, EER represents the extraction error rate, and CER refers to the calculation error rate.