Table of Contents
Fetching ...

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

Xin Guo, Haotian Xia, Zhaowei Liu, Hanyang Cao, Zhi Yang, Zhiqiang Liu, Sizhe Wang, Jinyi Niu, Chuqi Wang, Yanhui Wang, Xiaolong Liang, Xiaoming Huang, Bing Zhu, Zhongyu Wei, Yun Chen, Weining Shen, Liwen Zhang

TL;DR

FinEval introduces a comprehensive, Chinese-domain benchmark to evaluate large language models across four financial knowledge areas, with a particular emphasis on financial security and financial agent tasks. The dataset of 8,351 questions spans academic, industry, security, and agent dimensions, and is evaluated under zero-shot, five-shot, and chain-of-thought prompting using both objective and subjective metrics. Key findings show top closed-source and open-source entrants approaching but not matching expert performance, with five-shot CoT prompting yielding substantial gains and GPT-4o-based evaluation aiding complex agent assessment. The work presents a robust, domain-aligned benchmark and provides insights into current limitations and future directions for finance-focused LLM evaluation in the Chinese context.

Abstract

Large language models have demonstrated outstanding performance in various natural language processing tasks, but their security capabilities in the financial domain have not been explored, and their performance on complex tasks like financial agent remains unknown. This paper presents FinEval, a benchmark designed to evaluate LLMs' financial domain knowledge and practical abilities. The dataset contains 8,351 questions categorized into four different key areas: Financial Academic Knowledge, Financial Industry Knowledge, Financial Security Knowledge, and Financial Agent. Financial Academic Knowledge comprises 4,661 multiple-choice questions spanning 34 subjects such as finance and economics. Financial Industry Knowledge contains 1,434 questions covering practical scenarios like investment research. Financial Security Knowledge assesses models through 1,640 questions on topics like application security and cryptography. Financial Agent evaluates tool usage and complex reasoning with 616 questions. FinEval has multiple evaluation settings, including zero-shot, five-shot with chain-of-thought, and assesses model performance using objective and subjective criteria. Our results show that Claude 3.5-Sonnet achieves the highest weighted average score of 72.9 across all financial domain categories under zero-shot setting. Our work provides a comprehensive benchmark closely aligned with Chinese financial domain.

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

TL;DR

FinEval introduces a comprehensive, Chinese-domain benchmark to evaluate large language models across four financial knowledge areas, with a particular emphasis on financial security and financial agent tasks. The dataset of 8,351 questions spans academic, industry, security, and agent dimensions, and is evaluated under zero-shot, five-shot, and chain-of-thought prompting using both objective and subjective metrics. Key findings show top closed-source and open-source entrants approaching but not matching expert performance, with five-shot CoT prompting yielding substantial gains and GPT-4o-based evaluation aiding complex agent assessment. The work presents a robust, domain-aligned benchmark and provides insights into current limitations and future directions for finance-focused LLM evaluation in the Chinese context.

Abstract

Large language models have demonstrated outstanding performance in various natural language processing tasks, but their security capabilities in the financial domain have not been explored, and their performance on complex tasks like financial agent remains unknown. This paper presents FinEval, a benchmark designed to evaluate LLMs' financial domain knowledge and practical abilities. The dataset contains 8,351 questions categorized into four different key areas: Financial Academic Knowledge, Financial Industry Knowledge, Financial Security Knowledge, and Financial Agent. Financial Academic Knowledge comprises 4,661 multiple-choice questions spanning 34 subjects such as finance and economics. Financial Industry Knowledge contains 1,434 questions covering practical scenarios like investment research. Financial Security Knowledge assesses models through 1,640 questions on topics like application security and cryptography. Financial Agent evaluates tool usage and complex reasoning with 616 questions. FinEval has multiple evaluation settings, including zero-shot, five-shot with chain-of-thought, and assesses model performance using objective and subjective criteria. Our results show that Claude 3.5-Sonnet achieves the highest weighted average score of 72.9 across all financial domain categories under zero-shot setting. Our work provides a comprehensive benchmark closely aligned with Chinese financial domain.
Paper Structure (28 sections, 18 figures, 24 tables)

This paper contains 28 sections, 18 figures, 24 tables.

Figures (18)

  • Figure 1: FinEval is divided into four parts:Financial Academic Knowledge, Finance Industry Knowledge, Financial Security Knowledge and Financial Agent. The number of each sub-dataset is indicated after the corresponding name.
  • Figure 2: Examples of financial security and financial agent. For better readability, the English translation is displayed below the corresponding Chinese text. Additional examples can be found in Appendix \ref{['sec:example']}.
  • Figure 3: Error analysis results of ten models. Each bar represents the proportion of a specific type of error among all errors made by a particular model. The sum of the values of the three bars for a model equals to 1, representing the total error distribution for that model.
  • Figure 4: Zero-shot example of multiple-choice questions in Intermediate Financial Accounting. For better readability, the English translation is displayed below the corresponding Chinese text.
  • Figure 5: An instance of five-shot evaluation. The red text denotes the response automatically generated by the model, with the preceding text being the input prompt. English translations for the related Chinese text are provided beneath.
  • ...and 13 more figures