FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

Shu Liu; Shangqing Zhao; Chenghao Jia; Xinlin Zhuang; Zhaoguang Long; Jie Zhou; Aimin Zhou; Man Lan; Qingquan Wu; Chong Yang

FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

Shu Liu, Shangqing Zhao, Chenghao Jia, Xinlin Zhuang, Zhaoguang Long, Jie Zhou, Aimin Zhou, Man Lan, Qingquan Wu, Chong Yang

TL;DR

FinDABench addresses the gap in assessing financial data analysis capabilities of large language models by introducing a three-dimension taxonomy (Foundational, Reasoning, Technical) and six sub-tasks (Numerical Calculation QA, Early Warning Analysis, Fin-report Fraud Detection, Fin-report2Markdown, ChartData2Insight, NL2ViSQL). The benchmark comprises 2,400 instances, spanning 800 Foundational, 1,300 Reasoning, and 400 Technical data points, and is used to evaluate 41 LLMs across zero-shot and few-shot settings. Results show that even state-of-the-art models like GPT-4 achieve only about 32.37% in zero-shot and 39.38% in few-shot averages, with domain-specific fine-tuning providing notable gains but many tasks remaining challenging, especially those requiring data-centric reasoning and visualization. The work demonstrates the importance of finance-focused fine-tuning and data-centric evaluation to advance LLM capabilities in financial data analysis, and it provides a benchmark and dataset framework to guide future research and development.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of financial data analysis, particularly focusing on data-driven thinking, remain uncertain. To bridge this gap, we introduce \texttt{FinDABench}, a comprehensive benchmark designed to evaluate the financial data analysis capabilities of LLMs within this context. \texttt{FinDABench} assesses LLMs across three dimensions: 1) \textbf{Foundational Ability}, evaluating the models' ability to perform financial numerical calculation and corporate sentiment risk assessment; 2) \textbf{Reasoning Ability}, determining the models' ability to quickly comprehend textual information and analyze abnormal financial reports; and 3) \textbf{Technical Skill}, examining the models' use of technical knowledge to address real-world data analysis challenges involving analysis generation and charts visualization from multiple perspectives. We will release \texttt{FinDABench}, and the evaluation scripts at \url{https://github.com/cubenlp/BIBench}. \texttt{FinDABench} aims to provide a measure for in-depth analysis of LLM abilities and foster the advancement of LLMs in the field of financial data analysis.

FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

TL;DR

Abstract

Paper Structure (26 sections, 1 equation, 13 figures, 6 tables)

This paper contains 26 sections, 1 equation, 13 figures, 6 tables.

Introduction
Related Work
Benchmarks for Large Language Models
Advancements in Large Language Models
FinDA Benchmark
Foundational Ability
Reasoning Ability
Technical Skill
Experiment
Dataset Statistics
Evaluation Metrics
Evaluated Models
Experiment Setting
Main Results
In-depth Analysis
...and 11 more sections

Figures (13)

Figure 1: The job skills and their corresponding task names required for financial analysts to manage daily work. Text highlighted in green denotes the standard capabilities of financial analysts.
Figure 2: FinDABench aims to provide a multi-faceted evaluation framework that mirrors the multifarious nature of financial data analysis tasks.
Figure 3: Data examples for the six sub-tasks of FinDABench, each including questions and answers with a unique identifier to facilitate differentiation. For the English version, please see the appendix\ref{['sec:enversiondataexamples']}.
Figure 4: The statistical information for each sub-task of FinDABench is as follows: (a) represents Numberical Calculation QA, (b) represents Early Warning Analysis, (c) represents Fin-Report Fraud Detection, (d) represents Fin-Report2Markdown, (e) represents ChartData2Insight, and (f) represents NL2VisQL.
Figure 5: Average performance (zero-shot) of 41 LLMs evaluated on FinDABench
...and 8 more figures

FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

TL;DR

Abstract

FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)