Table of Contents
Fetching ...

Neurosymbolic Information Extraction from Transactional Documents

Arthur Hemmer, Mickaël Coustaty, Nicola Bartolo, Jean-Marc Ogier

TL;DR

This work tackles information extraction from transactional documents by marrying neural generation with symbolic, schema-driven validation. A neurosymbolic pipeline uses LLMs to produce candidate extractions that are then filtered through syntactic, task, and domain-level checks, enforcing domain arithmetic integrity. The authors introduce a comprehensive 53-field transactional schema and relabeled datasets (CORD_TD and SROIE_TD) to support high-quality label generation for knowledge distillation. Results show that multi-layer validation improves $F_1$-scores and accuracy, especially in zero-shot and distillation settings, and the relabeled data provide a valuable benchmark for future research in structured IE on documents.

Abstract

This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_1$-scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.

Neurosymbolic Information Extraction from Transactional Documents

TL;DR

This work tackles information extraction from transactional documents by marrying neural generation with symbolic, schema-driven validation. A neurosymbolic pipeline uses LLMs to produce candidate extractions that are then filtered through syntactic, task, and domain-level checks, enforcing domain arithmetic integrity. The authors introduce a comprehensive 53-field transactional schema and relabeled datasets (CORD_TD and SROIE_TD) to support high-quality label generation for knowledge distillation. Results show that multi-layer validation improves -scores and accuracy, especially in zero-shot and distillation settings, and the relabeled data provide a valuable benchmark for future research in structured IE on documents.

Abstract

This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in -scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.

Paper Structure

This paper contains 26 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the information extraction pipeline, where a prompt, document, and schema are processed by an LLM. Outputs are filtered for syntax, task, and domain validity, ensuring only high-quality labels are retained
  • Figure 2: Overview of the Schema we introduce for Domain-level validation of the information extracted from transactional documents.