Table of Contents
Fetching ...

EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements

Issa Sugiura, Takashi Ishida, Taro Makino, Chieko Tazuke, Takanori Nakagawa, Kosuke Nakago, David Ha

TL;DR

The experiments show that even state-of-the-art LLMs struggle in the financial domain, performing only marginally better than logistic regression in binary classification tasks such as fraud detection and earnings forecasting, and highlights the need for benchmark frameworks that better reflect the environments in which financial professionals operate.

Abstract

Large Language Models (LLMs) have made remarkable progress, surpassing human performance on several benchmarks in domains such as mathematics and coding. A key driver of this progress has been the development of benchmark datasets. In contrast, the financial domain poses higher entry barriers due to its demand for specialized expertise, and benchmarks remain relatively scarce compared to those in mathematics or coding. We introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate LLMs on challenging tasks such as accounting fraud detection, earnings forecasting, and industry classification. EDINET-Bench is constructed from ten years of annual reports filed by Japanese companies. These tasks require models to process entire annual reports and integrate information across multiple tables and textual sections, demanding expert-level reasoning that is challenging even for human professionals. Our experiments show that even state-of-the-art LLMs struggle in this domain, performing only marginally better than logistic regression in binary classification tasks such as fraud detection and earnings forecasting. Our results show that simply providing reports to LLMs in a straightforward setting is not enough. This highlights the need for benchmark frameworks that better reflect the environments in which financial professionals operate, with richer scaffolding such as realistic simulations and task-specific reasoning support to enable more effective problem solving. We make our dataset and code publicly available to support future research.

EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements

TL;DR

The experiments show that even state-of-the-art LLMs struggle in the financial domain, performing only marginally better than logistic regression in binary classification tasks such as fraud detection and earnings forecasting, and highlights the need for benchmark frameworks that better reflect the environments in which financial professionals operate.

Abstract

Large Language Models (LLMs) have made remarkable progress, surpassing human performance on several benchmarks in domains such as mathematics and coding. A key driver of this progress has been the development of benchmark datasets. In contrast, the financial domain poses higher entry barriers due to its demand for specialized expertise, and benchmarks remain relatively scarce compared to those in mathematics or coding. We introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate LLMs on challenging tasks such as accounting fraud detection, earnings forecasting, and industry classification. EDINET-Bench is constructed from ten years of annual reports filed by Japanese companies. These tasks require models to process entire annual reports and integrate information across multiple tables and textual sections, demanding expert-level reasoning that is challenging even for human professionals. Our experiments show that even state-of-the-art LLMs struggle in this domain, performing only marginally better than logistic regression in binary classification tasks such as fraud detection and earnings forecasting. Our results show that simply providing reports to LLMs in a straightforward setting is not enough. This highlights the need for benchmark frameworks that better reflect the environments in which financial professionals operate, with richer scaffolding such as realistic simulations and task-specific reasoning support to enable more effective problem solving. We make our dataset and code publicly available to support future research.

Paper Structure

This paper contains 37 sections, 14 figures, 9 tables.

Figures (14)

  • Figure 1: EDINET-Bench is a challenging benchmark evaluating LLMs on fraud detection, earnings forecasting, and industry classification from annual reports with text and tables.
  • Figure 2: Number of annual reports per fiscal year.
  • Figure 3: Prompt for accounting fraud detection.
  • Figure 4: Performance per fiscal first year on accounting fraud detection.
  • Figure 5: Prompt for earnings forecasting.
  • ...and 9 more figures