Table of Contents
Fetching ...

FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning

Shaoyu Dou, Yutian Shen, Mofan Chen, Zixuan Wang, Jiajie Xu, Qi Guo, Kailai Shao, Chao Chen, Haixiang Hu, Haibo Shi, Min Min, Liwen Zhang

TL;DR

FinEval-KR introduces a decoupled evaluation framework to quantify LLM knowledge and reasoning in finance, using Knowledge Score, Reasoning Score, and a Bloom-inspired Cognitive Score to diagnose higher-order cognitive capabilities. It adds an open-source Chinese financial reasoning dataset with multi-layer annotations, enabling root-cause analysis via a three-stage evaluation: unconstrained answering, knowledge-augmented answering, and error diagnosis. Experimental results show that higher-order cognitive abilities and reasoning quality drive accuracy, while knowledge application remains a bottleneck even for top models, and specialized financial LLMs underperform top general models. The framework supports targeted model improvement by distinguishing knowledge gaps from reasoning deficiencies and suggests a dual-path strategy that combines leveraging general-purpose models with domain-specific alignment and fine-tuning for financial reasoning.

Abstract

Large Language Models (LLMs) demonstrate significant potential but face challenges in complex financial reasoning tasks requiring both domain knowledge and sophisticated reasoning. Current evaluation benchmarks often fall short by not decoupling these capabilities indicators from single task performance and lack root cause analysis for task failure. To address this, we introduce FinEval-KR, a novel evaluation framework for decoupling and quantifying LLMs' knowledge and reasoning abilities independently, proposing distinct knowledge score and reasoning score metrics. Inspired by cognitive science, we further propose a cognitive score based on Bloom's taxonomy to analyze capabilities in reasoning tasks across different cognitive levels. We also release a new open-source Chinese financial reasoning dataset covering 22 subfields to support reproducible research and further advancements in financial reasoning. Our experimental results reveal that LLM reasoning ability and higher-order cognitive ability are the core factors influencing reasoning accuracy. We also specifically find that even top models still face a bottleneck with knowledge application. Furthermore, our analysis shows that specialized financial LLMs generally lag behind the top general large models across multiple metrics.

FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning

TL;DR

FinEval-KR introduces a decoupled evaluation framework to quantify LLM knowledge and reasoning in finance, using Knowledge Score, Reasoning Score, and a Bloom-inspired Cognitive Score to diagnose higher-order cognitive capabilities. It adds an open-source Chinese financial reasoning dataset with multi-layer annotations, enabling root-cause analysis via a three-stage evaluation: unconstrained answering, knowledge-augmented answering, and error diagnosis. Experimental results show that higher-order cognitive abilities and reasoning quality drive accuracy, while knowledge application remains a bottleneck even for top models, and specialized financial LLMs underperform top general models. The framework supports targeted model improvement by distinguishing knowledge gaps from reasoning deficiencies and suggests a dual-path strategy that combines leveraging general-purpose models with domain-specific alignment and fine-tuning for financial reasoning.

Abstract

Large Language Models (LLMs) demonstrate significant potential but face challenges in complex financial reasoning tasks requiring both domain knowledge and sophisticated reasoning. Current evaluation benchmarks often fall short by not decoupling these capabilities indicators from single task performance and lack root cause analysis for task failure. To address this, we introduce FinEval-KR, a novel evaluation framework for decoupling and quantifying LLMs' knowledge and reasoning abilities independently, proposing distinct knowledge score and reasoning score metrics. Inspired by cognitive science, we further propose a cognitive score based on Bloom's taxonomy to analyze capabilities in reasoning tasks across different cognitive levels. We also release a new open-source Chinese financial reasoning dataset covering 22 subfields to support reproducible research and further advancements in financial reasoning. Our experimental results reveal that LLM reasoning ability and higher-order cognitive ability are the core factors influencing reasoning accuracy. We also specifically find that even top models still face a bottleneck with knowledge application. Furthermore, our analysis shows that specialized financial LLMs generally lag behind the top general large models across multiple metrics.

Paper Structure

This paper contains 66 sections, 4 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Three-stage evaluation framework of FinEval-KR, and an exemplary sample of the dataset. Note that the original dataset is in Chinese, the figure provides an English translation for readability.
  • Figure 2: Example of review result generated by the judge model (original in Chinese, with English translation).
  • Figure 3: The prompt for experiment 1 and an exemplary sample (original in Chinese, with English translation).
  • Figure 4: The prompt for experiment 2 and an exemplary sample (original in Chinese, with English translation).
  • Figure 5: The prompt for experiment 3 and an exemplary sample (original in Chinese, with English translation).
  • ...and 15 more figures