LLM-based Triplet Extraction from Financial Reports
Dante Wesslund, Ville Stenström, Pontus Linde, Alexander Holmberg
TL;DR
This work tackles the challenge of extracting structured knowledge in the form of Subject-Predicate-Object triplets from financial reports without annotated ground truth. It proposes an ontology-driven extraction pipeline that uses Ontology Conformance and Faithfulness as evaluation axes, and compares a static manual ontology to a document-specific automatic ontology induction. A hybrid verification strategy combining strict regex matching with an LLM-as-a-judge is shown to dramatically reduce apparent subject/object hallucinations, while document-specific ontologies achieve consistent 100% schema conformance and minimize ontology drift. The approach enables ground-truth-free evaluation and robust KG construction from corporate disclosures, with practical implications for scalable knowledge graph creation in financial domains.
Abstract
Corporate financial reports are a valuable source of structured knowledge for Knowledge Graph construction, but the lack of annotated ground truth in this domain makes evaluation difficult. We present a semi-automated pipeline for Subject-Predicate-Object triplet extraction that uses ontology-driven proxy metrics, specifically Ontology Conformance and Faithfulness, instead of ground-truth-based evaluation. We compare a static, manually engineered ontology against a fully automated, document-specific ontology induction approach across different LLMs and two corporate annual reports. The automatically induced ontology achieves 100% schema conformance in all configurations, eliminating the ontology drift observed with the manual approach. We also propose a hybrid verification strategy that combines regex matching with an LLM-as-a-judge check, reducing apparent subject hallucination rates from 65.2% to 1.6% by filtering false positives caused by coreference resolution. Finally, we identify a systematic asymmetry between subject and object hallucinations, which we attribute to passive constructions and omitted agents in financial prose.
