Table of Contents
Fetching ...

FiNER-ORD: Financial Named Entity Recognition Open Research Dataset

Agam Shah, Abhinav Gullapalli, Ruchit Vithani, Michael Galarnyk, Sudheer Chava

TL;DR

FiNER-ORD introduces the first high-quality English financial NER Open Research Dataset, addressing the finance domain’s unique entity semantics with a manually annotated corpus derived from open financial news. The paper benchmarks multiple PLMs and zero-shot LLMs on FiNER-ORD, finding that fine-tuned PLMs (notably RoBERTa-based models) generally outperform zero-shot LLMs like GPT-4o across PER, LOC, and ORG entities. A key contribution is a critical comparison with the CRA and CoNLL datasets, demonstrating CRA’s labeling biases and FiNER-ORD’s more balanced, finance-focused distributions, alongside transfer-learning ablations that underscore the value of domain-specific data. The work also discusses ethical considerations, licensing, and the limitations of using Form 10-K filings, ultimately arguing that FiNER-ORD provides a solid benchmark for future finance-domain NER and related NLP tasks with broad practical impact in information extraction and knowledge graphs.

Abstract

Over the last two decades, the development of the CoNLL-2003 named entity recognition (NER) dataset has helped enhance the capabilities of deep learning and natural language processing (NLP). The finance domain, characterized by its unique semantic and lexical variations for the same entities, presents specific challenges to the NER task; thus, a domain-specific customized dataset is crucial for advancing research in this field. In our work, we develop the first high-quality English Financial NER Open Research Dataset (FiNER-ORD). We benchmark multiple pre-trained language models (PLMs) and large-language models (LLMs) on FiNER-ORD. We believe our proposed FiNER-ORD dataset will open future opportunities to use FiNER-ORD as a benchmark for financial domain-specific NER and NLP tasks. Our dataset, models, and code are publicly available on GitHub and Hugging Face under CC BY-NC 4.0 license.

FiNER-ORD: Financial Named Entity Recognition Open Research Dataset

TL;DR

FiNER-ORD introduces the first high-quality English financial NER Open Research Dataset, addressing the finance domain’s unique entity semantics with a manually annotated corpus derived from open financial news. The paper benchmarks multiple PLMs and zero-shot LLMs on FiNER-ORD, finding that fine-tuned PLMs (notably RoBERTa-based models) generally outperform zero-shot LLMs like GPT-4o across PER, LOC, and ORG entities. A key contribution is a critical comparison with the CRA and CoNLL datasets, demonstrating CRA’s labeling biases and FiNER-ORD’s more balanced, finance-focused distributions, alongside transfer-learning ablations that underscore the value of domain-specific data. The work also discusses ethical considerations, licensing, and the limitations of using Form 10-K filings, ultimately arguing that FiNER-ORD provides a solid benchmark for future finance-domain NER and related NLP tasks with broad practical impact in information extraction and knowledge graphs.

Abstract

Over the last two decades, the development of the CoNLL-2003 named entity recognition (NER) dataset has helped enhance the capabilities of deep learning and natural language processing (NLP). The finance domain, characterized by its unique semantic and lexical variations for the same entities, presents specific challenges to the NER task; thus, a domain-specific customized dataset is crucial for advancing research in this field. In our work, we develop the first high-quality English Financial NER Open Research Dataset (FiNER-ORD). We benchmark multiple pre-trained language models (PLMs) and large-language models (LLMs) on FiNER-ORD. We believe our proposed FiNER-ORD dataset will open future opportunities to use FiNER-ORD as a benchmark for financial domain-specific NER and NLP tasks. Our dataset, models, and code are publicly available on GitHub and Hugging Face under CC BY-NC 4.0 license.
Paper Structure (35 sections, 2 figures, 8 tables)

This paper contains 35 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Representative example of annotation in FiNER-ORD.
  • Figure 2: Screenshot of an article in FiNER-ORD manually annotated with the open-source Doccano annotation tool.