Table of Contents
Fetching ...

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu

TL;DR

BizFinBench introduces a business-driven benchmark for real-world financial LLM evaluation and a novel IteraJudge framework to reduce evaluator bias. By collecting 6,781 Chinese queries across five dimensions and nine categories, the benchmark emphasizes task realism, contextual complexity, and adversarial robustness. The study benchmarks 25 LLMs, revealing that no model dominates across all tasks and highlighting strengths in proprietary systems for knowledge-intensive tasks and open models for certain reasoning challenges. IteraJudge demonstrably improves evaluation reliability, paving the way for more trustworthy, finance-focused LLM deployments in industry settings.

Abstract

Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

TL;DR

BizFinBench introduces a business-driven benchmark for real-world financial LLM evaluation and a novel IteraJudge framework to reduce evaluator bias. By collecting 6,781 Chinese queries across five dimensions and nine categories, the benchmark emphasizes task realism, contextual complexity, and adversarial robustness. The study benchmarks 25 LLMs, revealing that no model dominates across all tasks and highlighting strengths in proprietary systems for knowledge-intensive tasks and open models for certain reasoning challenges. IteraJudge demonstrably improves evaluation reliability, paving the way for more trustworthy, finance-focused LLM deployments in industry settings.

Abstract

Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.

Paper Structure

This paper contains 21 sections, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Comparison of numerical calculation questions in Fin-Eva team2023FinEva and BizFinBench. The Fin-Eva example presents a straightforward financial math problem, while the BizFinBench example requires multi-step reasoning: first analyzing the problem, then extracting and utilizing relevant data from a provided markdown-formatted table for accurate computation. An Chinese version is included in the Appendix for clarity and ease of reference.
  • Figure 2: Distribution of tasks in BizFinBench across five key dimensions. The benchmark is structured around five dimensions, each focusing on a distinct capability of financial large language models. The figure also briefly illustrates the core focus of each dimension.
  • Figure 3: Workflow of BizFinBench dataset construction.
  • Figure 4: IteraJudge Pipeline.
  • Figure 5: The instructions utilized in the evaluation of the FDD dataset.
  • ...and 12 more figures