Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation
Zonghan Wu, Congyuan Zou, Junlin Wang, Chenhan Wang, Hangjing Yang, Yilei Shao
TL;DR
FinAR-Bench addresses the critical need for reliable benchmarking of large language models in financial fundamental analysis by focusing on financial statement analysis through three concrete subtasks: information extraction, indicator computation, and logical reasoning. The benchmark provides a dataset of 1,170 tasks across 100 Shanghai Stock Exchange companies (2022–2023) in two formats (XBRL and PDF) and introduces a structured evaluation framework combining RMS-based table assessment with tournament-style reasoning evaluation and an LLM-based judge. Experiments across 14 LLMs reveal strong performance in information extraction for large models, substantial gaps in precise numeric calculations for indicator computation, and variable but often reasonable reasoning—especially when prompts are augmented with explicit calculation knowledge. The results, along with a PDF- vs. text-format comparison, highlight practical challenges in deploying LLMs for actual financial analysis and underscore the need for improved layout-aware modeling, numerical reasoning, and domain-specific prompting, providing a foundation for future open benchmarks and agent-based evaluation in finance.
Abstract
Generative AI, particularly large language models (LLMs), is beginning to transform the financial industry by automating tasks and helping to make sense of complex financial information. One especially promising use case is the automatic creation of fundamental analysis reports, which are essential for making informed investment decisions, evaluating credit risks, guiding corporate mergers, etc. While LLMs attempt to generate these reports from a single prompt, the risks of inaccuracy are significant. Poor analysis can lead to misguided investments, regulatory issues, and loss of trust. Existing financial benchmarks mainly evaluate how well LLMs answer financial questions but do not reflect performance in real-world tasks like generating financial analysis reports. In this paper, we propose FinAR-Bench, a solid benchmark dataset focusing on financial statement analysis, a core competence of fundamental analysis. To make the evaluation more precise and reliable, we break this task into three measurable steps: extracting key information, calculating financial indicators, and applying logical reasoning. This structured approach allows us to objectively assess how well LLMs perform each step of the process. Our findings offer a clear understanding of LLMs current strengths and limitations in fundamental analysis and provide a more practical way to benchmark their performance in real-world financial settings.
