FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Charese H. Smiley
TL;DR
FinNLI introduces a large, multi-genre Financial Natural Language Inference benchmark designed to probe domain-specific reasoning in finance. The dataset construction combines premise sampling from real financial documents, dual-LMM hypothesis generation, Z-filtering to minimize spurious cues, and finance-expert annotation, resulting in 21,304 training/test-style pairs with a high-quality 3,304-instance test set. Empirical results show a substantial domain-shift penalty for general-domain NLI models, while large finance-oriented LLMs can surpass fine-tuned PLMs; nonetheless instruction-tuned finance LLMs underperform, highlighting gaps in current financial reasoning capabilities. The work demonstrates the dataset’s utility for evaluating domain adaptation and informing future model improvements in financial NLI and reasoning tasks with real-world financial texts.
Abstract
We introduce FinNLI, a benchmark dataset for Financial Natural Language Inference (FinNLI) across diverse financial texts like SEC Filings, Annual Reports, and Earnings Call transcripts. Our dataset framework ensures diverse premise-hypothesis pairs while minimizing spurious correlations. FinNLI comprises 21,304 pairs, including a high-quality test set of 3,304 instances annotated by finance experts. Evaluations show that domain shift significantly degrades general-domain NLI performance. The highest Macro F1 scores for pre-trained (PLMs) and large language models (LLMs) baselines are 74.57% and 78.62%, respectively, highlighting the dataset's difficulty. Surprisingly, instruction-tuned financial LLMs perform poorly, suggesting limited generalizability. FinNLI exposes weaknesses in current LLMs for financial reasoning, indicating room for improvement.
