Table of Contents
Fetching ...

LLM-based Triplet Extraction from Financial Reports

Dante Wesslund, Ville Stenström, Pontus Linde, Alexander Holmberg

TL;DR

This work tackles the challenge of extracting structured knowledge in the form of Subject-Predicate-Object triplets from financial reports without annotated ground truth. It proposes an ontology-driven extraction pipeline that uses Ontology Conformance and Faithfulness as evaluation axes, and compares a static manual ontology to a document-specific automatic ontology induction. A hybrid verification strategy combining strict regex matching with an LLM-as-a-judge is shown to dramatically reduce apparent subject/object hallucinations, while document-specific ontologies achieve consistent 100% schema conformance and minimize ontology drift. The approach enables ground-truth-free evaluation and robust KG construction from corporate disclosures, with practical implications for scalable knowledge graph creation in financial domains.

Abstract

Corporate financial reports are a valuable source of structured knowledge for Knowledge Graph construction, but the lack of annotated ground truth in this domain makes evaluation difficult. We present a semi-automated pipeline for Subject-Predicate-Object triplet extraction that uses ontology-driven proxy metrics, specifically Ontology Conformance and Faithfulness, instead of ground-truth-based evaluation. We compare a static, manually engineered ontology against a fully automated, document-specific ontology induction approach across different LLMs and two corporate annual reports. The automatically induced ontology achieves 100% schema conformance in all configurations, eliminating the ontology drift observed with the manual approach. We also propose a hybrid verification strategy that combines regex matching with an LLM-as-a-judge check, reducing apparent subject hallucination rates from 65.2% to 1.6% by filtering false positives caused by coreference resolution. Finally, we identify a systematic asymmetry between subject and object hallucinations, which we attribute to passive constructions and omitted agents in financial prose.

LLM-based Triplet Extraction from Financial Reports

TL;DR

This work tackles the challenge of extracting structured knowledge in the form of Subject-Predicate-Object triplets from financial reports without annotated ground truth. It proposes an ontology-driven extraction pipeline that uses Ontology Conformance and Faithfulness as evaluation axes, and compares a static manual ontology to a document-specific automatic ontology induction. A hybrid verification strategy combining strict regex matching with an LLM-as-a-judge is shown to dramatically reduce apparent subject/object hallucinations, while document-specific ontologies achieve consistent 100% schema conformance and minimize ontology drift. The approach enables ground-truth-free evaluation and robust KG construction from corporate disclosures, with practical implications for scalable knowledge graph creation in financial domains.

Abstract

Corporate financial reports are a valuable source of structured knowledge for Knowledge Graph construction, but the lack of annotated ground truth in this domain makes evaluation difficult. We present a semi-automated pipeline for Subject-Predicate-Object triplet extraction that uses ontology-driven proxy metrics, specifically Ontology Conformance and Faithfulness, instead of ground-truth-based evaluation. We compare a static, manually engineered ontology against a fully automated, document-specific ontology induction approach across different LLMs and two corporate annual reports. The automatically induced ontology achieves 100% schema conformance in all configurations, eliminating the ontology drift observed with the manual approach. We also propose a hybrid verification strategy that combines regex matching with an LLM-as-a-judge check, reducing apparent subject hallucination rates from 65.2% to 1.6% by filtering false positives caused by coreference resolution. Finally, we identify a systematic asymmetry between subject and object hallucinations, which we attribute to passive constructions and omitted agents in financial prose.
Paper Structure (27 sections, 1 equation, 1 figure, 2 tables)

This paper contains 27 sections, 1 equation, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Comparison of ontology strategies. (a) Manual strategy: a single static ontology is derived from the Volvo report and applied to both documents, testing domain generalization. (b) Automatic strategy: an LLM induces a unique ontology for each document before extraction, testing document-specific adaptation.