Table of Contents
Fetching ...

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

Zhuohan Xie, Daniil Orel, Rushil Thareja, Dhruv Sahnan, Hachem Madmoun, Fan Zhang, Debopriyo Banerjee, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Jinyan Su, Aaryamonvikram Singh, Rui Xing, Rania Elbadry, Chen Xu, Haonan Li, Fajri Koto, Ivan Koychev, Tanmoy Chakraborty, Yuxia Wang, Salem Lahlou, Veselin Stoyanov, Sophia Ananiadou, Preslav Nakov

TL;DR

FinChain introduces a symbolic, machine-verifiable benchmark for verifiable chain-of-thought financial reasoning, spanning $58$ topics across $12$ domains with five parameterized templates per topic and executable Python traces. It couples the dataset with ChainEval, a dynamic alignment metric that jointly evaluates final-answer correctness and step-level faithfulness through semantic and numeric matching within a Dynamic Time Warping framework. Large-scale evaluation across $26$ LLMs shows frontier models achieve the highest ChainEval, yet still struggle with long-horizon symbolic reasoning, while domain-adapted and math-enhanced open models close part of the gap through targeted supervision. The work provides a rigorous, contamination-free platform for developing interpretable, verifiable financial AI and outlines future directions toward multilingual, real-document, and regulation-aware reasoning.

Abstract

Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought (CoT) evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python traces that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose ChainEval, a dynamic alignment metric that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Evaluating 26 leading LLMs reveals that even frontier proprietary systems exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models substantially narrow this gap. Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI.

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

TL;DR

FinChain introduces a symbolic, machine-verifiable benchmark for verifiable chain-of-thought financial reasoning, spanning topics across domains with five parameterized templates per topic and executable Python traces. It couples the dataset with ChainEval, a dynamic alignment metric that jointly evaluates final-answer correctness and step-level faithfulness through semantic and numeric matching within a Dynamic Time Warping framework. Large-scale evaluation across LLMs shows frontier models achieve the highest ChainEval, yet still struggle with long-horizon symbolic reasoning, while domain-adapted and math-enhanced open models close part of the gap through targeted supervision. The work provides a rigorous, contamination-free platform for developing interpretable, verifiable financial AI and outlines future directions toward multilingual, real-document, and regulation-aware reasoning.

Abstract

Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought (CoT) evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python traces that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose ChainEval, a dynamic alignment metric that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Evaluating 26 leading LLMs reveals that even frontier proprietary systems exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models substantially narrow this gap. Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI.

Paper Structure

This paper contains 50 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Symbolic template for generating compound interest problems in FinChain.
  • Figure 2: FinChain taxonomy of financial reasoning topics. Our benchmark spans 58 topics organized into 12 major domains, ranging from traditional areas like Corporate Finance and Financial Reporting to emerging fields such as Crypto Finance and Sustainable Finance. This hierarchical structure enables fine-grained evaluation of symbolic reasoning across diverse financial domains.
  • Figure 3: Domain-level performance across financial domains. Radar plot comparing the best-performing model from each category in \ref{['tab:overall_model_performance']}, evaluated across twelve financial domains using ChainEval scores.
  • Figure 4: ChainEval score across difficulty levels.
  • Figure 5: Reference examples for compound interest templates, illustrating typical annotation cases with error tags, flawed solutions, and minimal fixes.
  • ...and 1 more figures