Neurosymbolic Information Extraction from Transactional Documents
Arthur Hemmer, Mickaël Coustaty, Nicola Bartolo, Jean-Marc Ogier
TL;DR
This work tackles information extraction from transactional documents by marrying neural generation with symbolic, schema-driven validation. A neurosymbolic pipeline uses LLMs to produce candidate extractions that are then filtered through syntactic, task, and domain-level checks, enforcing domain arithmetic integrity. The authors introduce a comprehensive 53-field transactional schema and relabeled datasets (CORD_TD and SROIE_TD) to support high-quality label generation for knowledge distillation. Results show that multi-layer validation improves $F_1$-scores and accuracy, especially in zero-shot and distillation settings, and the relabeled data provide a valuable benchmark for future research in structured IE on documents.
Abstract
This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_1$-scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.
