FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
Zhuohan Xie, Daniil Orel, Rushil Thareja, Dhruv Sahnan, Hachem Madmoun, Fan Zhang, Debopriyo Banerjee, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Jinyan Su, Aaryamonvikram Singh, Rui Xing, Rania Elbadry, Chen Xu, Haonan Li, Fajri Koto, Ivan Koychev, Tanmoy Chakraborty, Yuxia Wang, Salem Lahlou, Veselin Stoyanov, Sophia Ananiadou, Preslav Nakov
TL;DR
FinChain introduces a symbolic, machine-verifiable benchmark for verifiable chain-of-thought financial reasoning, spanning $58$ topics across $12$ domains with five parameterized templates per topic and executable Python traces. It couples the dataset with ChainEval, a dynamic alignment metric that jointly evaluates final-answer correctness and step-level faithfulness through semantic and numeric matching within a Dynamic Time Warping framework. Large-scale evaluation across $26$ LLMs shows frontier models achieve the highest ChainEval, yet still struggle with long-horizon symbolic reasoning, while domain-adapted and math-enhanced open models close part of the gap through targeted supervision. The work provides a rigorous, contamination-free platform for developing interpretable, verifiable financial AI and outlines future directions toward multilingual, real-document, and regulation-aware reasoning.
Abstract
Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought (CoT) evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python traces that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose ChainEval, a dynamic alignment metric that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Evaluating 26 leading LLMs reveals that even frontier proprietary systems exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models substantially narrow this gap. Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI.
