HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings
Rasmus Aavang, Giovanni Rizzi, Rasmus Bøggild, Alexandre Iolov, Mike Zhang, Johannes Bjerva
TL;DR
The paper tackles the challenge of extracting hierarchical, numerical KPIs from iXBRL-tagged SEC filings by introducing HiFi-KPI, a large-scale dataset with a 218,126-label taxonomy, ~1.8 million paragraphs, and ~5 million entities. It proposes a taxonomy-based granularity selection via recursive ascent of the presentation and calculation taxonomies, enabling KPI extraction at multiple levels of detail. The authors provide baselines for text classification, sequence labeling, and LLM-based structured extraction, and release HiFi-KPI Lite for efficient evaluation; they also introduce merged and company-specific taxonomies to aid generalization. The work demonstrates the feasibility and value of structuring iXBRL labels for downstream tasks, showing that granularity and annotation quality significantly influence performance, and highlights the potential for domain-tuned models and expert mappings to improve extraction accuracy in financial contexts.
Abstract
The U.S. Securities and Exchange Commission (SEC) requires that public companies file financial reports tagging numbers with the machine readable inline eXtensible Business Reporting Language (iXBRL) standard. However, the highly complex and highly granular taxonomy defined by iXBRL limits label transferability across domains. In this paper, we introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset, designed to facilitate numerical KPI extraction at specified levels of granularity from unstructured financial text. Our approach organizes a 218,126-label hierarchy using a taxonomy based grouping method, investigating which taxonomy layer provides the most meaningful structure. HiFi-KPI comprises ~1.8M paragraphs and ~5M entities, each linked to a label in the iXBRL-specific calculation and presentation taxonomies. We provide baselines using encoder-based approaches and structured extraction using Large Language Models (LLMs). To simplify LLM inference and evaluation, we additionally release HiFi-KPI Lite, a manually curated subset with four expert-mapped labels. We publicly release all artifacts.
