Table of Contents
Fetching ...

FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

Shu Liu, Shangqing Zhao, Chenghao Jia, Xinlin Zhuang, Zhaoguang Long, Jie Zhou, Aimin Zhou, Man Lan, Qingquan Wu, Chong Yang

TL;DR

FinDABench addresses the gap in assessing financial data analysis capabilities of large language models by introducing a three-dimension taxonomy (Foundational, Reasoning, Technical) and six sub-tasks (Numerical Calculation QA, Early Warning Analysis, Fin-report Fraud Detection, Fin-report2Markdown, ChartData2Insight, NL2ViSQL). The benchmark comprises 2,400 instances, spanning 800 Foundational, 1,300 Reasoning, and 400 Technical data points, and is used to evaluate 41 LLMs across zero-shot and few-shot settings. Results show that even state-of-the-art models like GPT-4 achieve only about 32.37% in zero-shot and 39.38% in few-shot averages, with domain-specific fine-tuning providing notable gains but many tasks remaining challenging, especially those requiring data-centric reasoning and visualization. The work demonstrates the importance of finance-focused fine-tuning and data-centric evaluation to advance LLM capabilities in financial data analysis, and it provides a benchmark and dataset framework to guide future research and development.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of financial data analysis, particularly focusing on data-driven thinking, remain uncertain. To bridge this gap, we introduce \texttt{FinDABench}, a comprehensive benchmark designed to evaluate the financial data analysis capabilities of LLMs within this context. \texttt{FinDABench} assesses LLMs across three dimensions: 1) \textbf{Foundational Ability}, evaluating the models' ability to perform financial numerical calculation and corporate sentiment risk assessment; 2) \textbf{Reasoning Ability}, determining the models' ability to quickly comprehend textual information and analyze abnormal financial reports; and 3) \textbf{Technical Skill}, examining the models' use of technical knowledge to address real-world data analysis challenges involving analysis generation and charts visualization from multiple perspectives. We will release \texttt{FinDABench}, and the evaluation scripts at \url{https://github.com/cubenlp/BIBench}. \texttt{FinDABench} aims to provide a measure for in-depth analysis of LLM abilities and foster the advancement of LLMs in the field of financial data analysis.

FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

TL;DR

FinDABench addresses the gap in assessing financial data analysis capabilities of large language models by introducing a three-dimension taxonomy (Foundational, Reasoning, Technical) and six sub-tasks (Numerical Calculation QA, Early Warning Analysis, Fin-report Fraud Detection, Fin-report2Markdown, ChartData2Insight, NL2ViSQL). The benchmark comprises 2,400 instances, spanning 800 Foundational, 1,300 Reasoning, and 400 Technical data points, and is used to evaluate 41 LLMs across zero-shot and few-shot settings. Results show that even state-of-the-art models like GPT-4 achieve only about 32.37% in zero-shot and 39.38% in few-shot averages, with domain-specific fine-tuning providing notable gains but many tasks remaining challenging, especially those requiring data-centric reasoning and visualization. The work demonstrates the importance of finance-focused fine-tuning and data-centric evaluation to advance LLM capabilities in financial data analysis, and it provides a benchmark and dataset framework to guide future research and development.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of financial data analysis, particularly focusing on data-driven thinking, remain uncertain. To bridge this gap, we introduce \texttt{FinDABench}, a comprehensive benchmark designed to evaluate the financial data analysis capabilities of LLMs within this context. \texttt{FinDABench} assesses LLMs across three dimensions: 1) \textbf{Foundational Ability}, evaluating the models' ability to perform financial numerical calculation and corporate sentiment risk assessment; 2) \textbf{Reasoning Ability}, determining the models' ability to quickly comprehend textual information and analyze abnormal financial reports; and 3) \textbf{Technical Skill}, examining the models' use of technical knowledge to address real-world data analysis challenges involving analysis generation and charts visualization from multiple perspectives. We will release \texttt{FinDABench}, and the evaluation scripts at \url{https://github.com/cubenlp/BIBench}. \texttt{FinDABench} aims to provide a measure for in-depth analysis of LLM abilities and foster the advancement of LLMs in the field of financial data analysis.
Paper Structure (26 sections, 1 equation, 13 figures, 6 tables)

This paper contains 26 sections, 1 equation, 13 figures, 6 tables.

Figures (13)

  • Figure 1: The job skills and their corresponding task names required for financial analysts to manage daily work. Text highlighted in green denotes the standard capabilities of financial analysts.
  • Figure 2: FinDABench aims to provide a multi-faceted evaluation framework that mirrors the multifarious nature of financial data analysis tasks.
  • Figure 3: Data examples for the six sub-tasks of FinDABench, each including questions and answers with a unique identifier to facilitate differentiation. For the English version, please see the appendix\ref{['sec:enversiondataexamples']}.
  • Figure 4: The statistical information for each sub-task of FinDABench is as follows: (a) represents Numberical Calculation QA, (b) represents Early Warning Analysis, (c) represents Fin-Report Fraud Detection, (d) represents Fin-Report2Markdown, (e) represents ChartData2Insight, and (f) represents NL2VisQL.
  • Figure 5: Average performance (zero-shot) of 41 LLMs evaluated on FinDABench
  • ...and 8 more figures