Table of Contents
Fetching ...

CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models

Ying Nie, Binwei Yan, Tianyu Guo, Hao Liu, Haoyu Wang, Wei He, Binfan Zheng, Weihao Wang, Qiang Li, Weijian Sun, Yunhe Wang, Dacheng Tao

TL;DR

CFinBench tackles the need for a domain-specific, Chinese-financial evaluation benchmark for large language models. It introduces a four-category taxonomy—Financial Subject, Financial Qualification, Financial Practice, and Financial Law—spanning 99,100 questions across 43 second-level categories and three question types (single-choice, multiple-choice, judgment). The dataset undergoes rigorous cleaning, deduplication via MinHash, LLM-based rephrasing, option shuffling, and multiple human validations, and is evaluated across 50 diverse LLMs with zero-shot and few-shot prompts, using an OpenCompass-based inference setup. Results show GPT-4 and Chinese-oriented models leading with an average accuracy around 60%, indicating substantial challenge and room for improvement in the Chinese financial domain; the work also demonstrates the impact of model size, domain-specific pre-training, and prompt strategies. The dataset and code are publicly available, enabling reproducible, domain-focused benchmarking for financial LLMs and guiding future advances in domain-specific AI for finance.

Abstract

Large language models (LLMs) have achieved remarkable performance on various NLP tasks, yet their potential in more challenging and domain-specific task, such as finance, has not been fully explored. In this paper, we present CFinBench: a meticulously crafted, the most comprehensive evaluation benchmark to date, for assessing the financial knowledge of LLMs under Chinese context. In practice, to better align with the career trajectory of Chinese financial practitioners, we build a systematic evaluation from 4 first-level categories: (1) Financial Subject: whether LLMs can memorize the necessary basic knowledge of financial subjects, such as economics, statistics and auditing. (2) Financial Qualification: whether LLMs can obtain the needed financial qualified certifications, such as certified public accountant, securities qualification and banking qualification. (3) Financial Practice: whether LLMs can fulfill the practical financial jobs, such as tax consultant, junior accountant and securities analyst. (4) Financial Law: whether LLMs can meet the requirement of financial laws and regulations, such as tax law, insurance law and economic law. CFinBench comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment. We conduct extensive experiments of 50 representative LLMs with various model size on CFinBench. The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%, highlighting the challenge presented by CFinBench. The dataset and evaluation code are available at https://cfinbench.github.io/.

CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models

TL;DR

CFinBench tackles the need for a domain-specific, Chinese-financial evaluation benchmark for large language models. It introduces a four-category taxonomy—Financial Subject, Financial Qualification, Financial Practice, and Financial Law—spanning 99,100 questions across 43 second-level categories and three question types (single-choice, multiple-choice, judgment). The dataset undergoes rigorous cleaning, deduplication via MinHash, LLM-based rephrasing, option shuffling, and multiple human validations, and is evaluated across 50 diverse LLMs with zero-shot and few-shot prompts, using an OpenCompass-based inference setup. Results show GPT-4 and Chinese-oriented models leading with an average accuracy around 60%, indicating substantial challenge and room for improvement in the Chinese financial domain; the work also demonstrates the impact of model size, domain-specific pre-training, and prompt strategies. The dataset and code are publicly available, enabling reproducible, domain-focused benchmarking for financial LLMs and guiding future advances in domain-specific AI for finance.

Abstract

Large language models (LLMs) have achieved remarkable performance on various NLP tasks, yet their potential in more challenging and domain-specific task, such as finance, has not been fully explored. In this paper, we present CFinBench: a meticulously crafted, the most comprehensive evaluation benchmark to date, for assessing the financial knowledge of LLMs under Chinese context. In practice, to better align with the career trajectory of Chinese financial practitioners, we build a systematic evaluation from 4 first-level categories: (1) Financial Subject: whether LLMs can memorize the necessary basic knowledge of financial subjects, such as economics, statistics and auditing. (2) Financial Qualification: whether LLMs can obtain the needed financial qualified certifications, such as certified public accountant, securities qualification and banking qualification. (3) Financial Practice: whether LLMs can fulfill the practical financial jobs, such as tax consultant, junior accountant and securities analyst. (4) Financial Law: whether LLMs can meet the requirement of financial laws and regulations, such as tax law, insurance law and economic law. CFinBench comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment. We conduct extensive experiments of 50 representative LLMs with various model size on CFinBench. The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%, highlighting the challenge presented by CFinBench. The dataset and evaluation code are available at https://cfinbench.github.io/.
Paper Structure (33 sections, 6 figures, 8 tables)

This paper contains 33 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: CFinBench comprises 4 first-level categories and 43 second-level categories, which are more align with the career trajectory of financial practitioners.
  • Figure 2: Examples of 3 types of questions in CFinBench. English translations are shown in blue for better readability.
  • Figure 3: Examples of question rephrasing. English translations are shown in blue for better readability. In each example, the top is the original question, and the bottom is the rephrased question.
  • Figure 4: Examples of zero-shot prompts in answer-only setting. English translations are shown in blue for better readability.
  • Figure 5: Examples of few-shot prompts in answer-only setting. English translations are shown in blue for better readability.
  • ...and 1 more figures