Table of Contents
Fetching ...

FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking

Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Charese H. Smiley

TL;DR

FinNLI introduces a large, multi-genre Financial Natural Language Inference benchmark designed to probe domain-specific reasoning in finance. The dataset construction combines premise sampling from real financial documents, dual-LMM hypothesis generation, Z-filtering to minimize spurious cues, and finance-expert annotation, resulting in 21,304 training/test-style pairs with a high-quality 3,304-instance test set. Empirical results show a substantial domain-shift penalty for general-domain NLI models, while large finance-oriented LLMs can surpass fine-tuned PLMs; nonetheless instruction-tuned finance LLMs underperform, highlighting gaps in current financial reasoning capabilities. The work demonstrates the dataset’s utility for evaluating domain adaptation and informing future model improvements in financial NLI and reasoning tasks with real-world financial texts.

Abstract

We introduce FinNLI, a benchmark dataset for Financial Natural Language Inference (FinNLI) across diverse financial texts like SEC Filings, Annual Reports, and Earnings Call transcripts. Our dataset framework ensures diverse premise-hypothesis pairs while minimizing spurious correlations. FinNLI comprises 21,304 pairs, including a high-quality test set of 3,304 instances annotated by finance experts. Evaluations show that domain shift significantly degrades general-domain NLI performance. The highest Macro F1 scores for pre-trained (PLMs) and large language models (LLMs) baselines are 74.57% and 78.62%, respectively, highlighting the dataset's difficulty. Surprisingly, instruction-tuned financial LLMs perform poorly, suggesting limited generalizability. FinNLI exposes weaknesses in current LLMs for financial reasoning, indicating room for improvement.

FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking

TL;DR

FinNLI introduces a large, multi-genre Financial Natural Language Inference benchmark designed to probe domain-specific reasoning in finance. The dataset construction combines premise sampling from real financial documents, dual-LMM hypothesis generation, Z-filtering to minimize spurious cues, and finance-expert annotation, resulting in 21,304 training/test-style pairs with a high-quality 3,304-instance test set. Empirical results show a substantial domain-shift penalty for general-domain NLI models, while large finance-oriented LLMs can surpass fine-tuned PLMs; nonetheless instruction-tuned finance LLMs underperform, highlighting gaps in current financial reasoning capabilities. The work demonstrates the dataset’s utility for evaluating domain adaptation and informing future model improvements in financial NLI and reasoning tasks with real-world financial texts.

Abstract

We introduce FinNLI, a benchmark dataset for Financial Natural Language Inference (FinNLI) across diverse financial texts like SEC Filings, Annual Reports, and Earnings Call transcripts. Our dataset framework ensures diverse premise-hypothesis pairs while minimizing spurious correlations. FinNLI comprises 21,304 pairs, including a high-quality test set of 3,304 instances annotated by finance experts. Evaluations show that domain shift significantly degrades general-domain NLI performance. The highest Macro F1 scores for pre-trained (PLMs) and large language models (LLMs) baselines are 74.57% and 78.62%, respectively, highlighting the dataset's difficulty. Surprisingly, instruction-tuned financial LLMs perform poorly, suggesting limited generalizability. FinNLI exposes weaknesses in current LLMs for financial reasoning, indicating room for improvement.

Paper Structure

This paper contains 44 sections, 3 figures, 13 tables.

Figures (3)

  • Figure 1: An example of NLI in financial risk assessment, which might seem Neutral as there is no explicit mention of "financial pressure" in the premise.
  • Figure 2: Overview of the FinNLI data generation pipeline. (1) We sample premises from real-world financial documents across multiple genres. (2) Hypothesis-label pairs are generated using multiple LLMs. (3) Z-filteringwu2022generating removes spurious correlations. (4) The prompt is refined based on feedback from a general-domain NLI model and expert curation. (5) Finally, instances correctly predicted and misclassified by the NLI model are reviewed by finance experts for gold label annotation.
  • Figure 3: Average macro F1 scores (%) for various LLMs across different prompting setups evaluated on the FinNLI test set. Prompts with AG have the class label definitions used in the annotation guidelines included in the prompt. The error bars represent the standard deviation across $3$ independent runs. The dotted grey line at $74.57\%$ marks the performance of RoBERTa-Large, the best performing fine-tuned PLM on FinNLI test set.