Table of Contents
Fetching ...

HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

Rasmus Aavang, Giovanni Rizzi, Rasmus Bøggild, Alexandre Iolov, Mike Zhang, Johannes Bjerva

TL;DR

The paper tackles the challenge of extracting hierarchical, numerical KPIs from iXBRL-tagged SEC filings by introducing HiFi-KPI, a large-scale dataset with a 218,126-label taxonomy, ~1.8 million paragraphs, and ~5 million entities. It proposes a taxonomy-based granularity selection via recursive ascent of the presentation and calculation taxonomies, enabling KPI extraction at multiple levels of detail. The authors provide baselines for text classification, sequence labeling, and LLM-based structured extraction, and release HiFi-KPI Lite for efficient evaluation; they also introduce merged and company-specific taxonomies to aid generalization. The work demonstrates the feasibility and value of structuring iXBRL labels for downstream tasks, showing that granularity and annotation quality significantly influence performance, and highlights the potential for domain-tuned models and expert mappings to improve extraction accuracy in financial contexts.

Abstract

The U.S. Securities and Exchange Commission (SEC) requires that public companies file financial reports tagging numbers with the machine readable inline eXtensible Business Reporting Language (iXBRL) standard. However, the highly complex and highly granular taxonomy defined by iXBRL limits label transferability across domains. In this paper, we introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset, designed to facilitate numerical KPI extraction at specified levels of granularity from unstructured financial text. Our approach organizes a 218,126-label hierarchy using a taxonomy based grouping method, investigating which taxonomy layer provides the most meaningful structure. HiFi-KPI comprises ~1.8M paragraphs and ~5M entities, each linked to a label in the iXBRL-specific calculation and presentation taxonomies. We provide baselines using encoder-based approaches and structured extraction using Large Language Models (LLMs). To simplify LLM inference and evaluation, we additionally release HiFi-KPI Lite, a manually curated subset with four expert-mapped labels. We publicly release all artifacts.

HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

TL;DR

The paper tackles the challenge of extracting hierarchical, numerical KPIs from iXBRL-tagged SEC filings by introducing HiFi-KPI, a large-scale dataset with a 218,126-label taxonomy, ~1.8 million paragraphs, and ~5 million entities. It proposes a taxonomy-based granularity selection via recursive ascent of the presentation and calculation taxonomies, enabling KPI extraction at multiple levels of detail. The authors provide baselines for text classification, sequence labeling, and LLM-based structured extraction, and release HiFi-KPI Lite for efficient evaluation; they also introduce merged and company-specific taxonomies to aid generalization. The work demonstrates the feasibility and value of structuring iXBRL labels for downstream tasks, showing that granularity and annotation quality significantly influence performance, and highlights the potential for domain-tuned models and expert mappings to improve extraction accuracy in financial contexts.

Abstract

The U.S. Securities and Exchange Commission (SEC) requires that public companies file financial reports tagging numbers with the machine readable inline eXtensible Business Reporting Language (iXBRL) standard. However, the highly complex and highly granular taxonomy defined by iXBRL limits label transferability across domains. In this paper, we introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset, designed to facilitate numerical KPI extraction at specified levels of granularity from unstructured financial text. Our approach organizes a 218,126-label hierarchy using a taxonomy based grouping method, investigating which taxonomy layer provides the most meaningful structure. HiFi-KPI comprises ~1.8M paragraphs and ~5M entities, each linked to a label in the iXBRL-specific calculation and presentation taxonomies. We provide baselines using encoder-based approaches and structured extraction using Large Language Models (LLMs). To simplify LLM inference and evaluation, we additionally release HiFi-KPI Lite, a manually curated subset with four expert-mapped labels. We publicly release all artifacts.

Paper Structure

This paper contains 33 sections, 1 equation, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Granularity Selection. Example snippet where arrows mark corresponding positions in a tree-map of the two taxonomies. Top left figure illustrate our recursive approach to ascending the hierarchy where leaves inherit their parents labels.
  • Figure 2: Results HiFi-KPI. We compare the aggregate average for macro F$_1$ for the presentation layer (left), the calculation layer (middle), and the sequence labeling experiment (right).
  • Figure 3: Tree map from us-gaap:RevnuesAbstract and down
  • Figure 4: System Prompt