FabricQA-Extractor: A Question Answering System to Extract Information from Documents using Natural Language Questions
Qiming Wang, Raul Castro Fernandez
TL;DR
This work tackles the challenge of extracting structured data from vast unstructured text by introducing Relation Coherence, a model that uses relational structure to augment open-domain question answering. Coupled with FabricQA-Extractor, it forms an end-to-end system that processes millions of documents with sub-second latency, combining offline chunking/indexing, passage ranking, and answer ranking with a backward-search coherence mechanism. Evaluations on QA-ZRE and BioNLP demonstrate improvements over strong baselines and show the approach works across domains with limited training data, highlighting the importance of relational consistency in information extraction. The proposed framework offers a scalable, transparent solution to populate missing table cells from large corpora, enabling practical large-scale data structuring for data management tasks.
Abstract
Reading comprehension models answer questions posed in natural language when provided with a short passage of text. They present an opportunity to address a long-standing challenge in data management: the extraction of structured data from unstructured text. Consequently, several approaches are using these models to perform information extraction. However, these modern approaches leave an opportunity behind because they do not exploit the relational structure of the target extraction table. In this paper, we introduce a new model, Relation Coherence, that exploits knowledge of the relational structure to improve the extraction quality. We incorporate the Relation Coherence model as part of FabricQA-Extractor, an end-to-end system we built from scratch to conduct large scale extraction tasks over millions of documents. We demonstrate on two datasets with millions of passages that Relation Coherence boosts extraction performance and evaluate FabricQA-Extractor on large scale datasets.
