Table of Contents
Fetching ...

FabricQA-Extractor: A Question Answering System to Extract Information from Documents using Natural Language Questions

Qiming Wang, Raul Castro Fernandez

TL;DR

This work tackles the challenge of extracting structured data from vast unstructured text by introducing Relation Coherence, a model that uses relational structure to augment open-domain question answering. Coupled with FabricQA-Extractor, it forms an end-to-end system that processes millions of documents with sub-second latency, combining offline chunking/indexing, passage ranking, and answer ranking with a backward-search coherence mechanism. Evaluations on QA-ZRE and BioNLP demonstrate improvements over strong baselines and show the approach works across domains with limited training data, highlighting the importance of relational consistency in information extraction. The proposed framework offers a scalable, transparent solution to populate missing table cells from large corpora, enabling practical large-scale data structuring for data management tasks.

Abstract

Reading comprehension models answer questions posed in natural language when provided with a short passage of text. They present an opportunity to address a long-standing challenge in data management: the extraction of structured data from unstructured text. Consequently, several approaches are using these models to perform information extraction. However, these modern approaches leave an opportunity behind because they do not exploit the relational structure of the target extraction table. In this paper, we introduce a new model, Relation Coherence, that exploits knowledge of the relational structure to improve the extraction quality. We incorporate the Relation Coherence model as part of FabricQA-Extractor, an end-to-end system we built from scratch to conduct large scale extraction tasks over millions of documents. We demonstrate on two datasets with millions of passages that Relation Coherence boosts extraction performance and evaluate FabricQA-Extractor on large scale datasets.

FabricQA-Extractor: A Question Answering System to Extract Information from Documents using Natural Language Questions

TL;DR

This work tackles the challenge of extracting structured data from vast unstructured text by introducing Relation Coherence, a model that uses relational structure to augment open-domain question answering. Coupled with FabricQA-Extractor, it forms an end-to-end system that processes millions of documents with sub-second latency, combining offline chunking/indexing, passage ranking, and answer ranking with a backward-search coherence mechanism. Evaluations on QA-ZRE and BioNLP demonstrate improvements over strong baselines and show the approach works across domains with limited training data, highlighting the importance of relational consistency in information extraction. The proposed framework offers a scalable, transparent solution to populate missing table cells from large corpora, enabling practical large-scale data structuring for data management tasks.

Abstract

Reading comprehension models answer questions posed in natural language when provided with a short passage of text. They present an opportunity to address a long-standing challenge in data management: the extraction of structured data from unstructured text. Consequently, several approaches are using these models to perform information extraction. However, these modern approaches leave an opportunity behind because they do not exploit the relational structure of the target extraction table. In this paper, we introduce a new model, Relation Coherence, that exploits knowledge of the relational structure to improve the extraction quality. We incorporate the Relation Coherence model as part of FabricQA-Extractor, an end-to-end system we built from scratch to conduct large scale extraction tasks over millions of documents. We demonstrate on two datasets with millions of passages that Relation Coherence boosts extraction performance and evaluate FabricQA-Extractor on large scale datasets.
Paper Structure (30 sections, 7 equations, 6 figures, 5 tables)

This paper contains 30 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: FabricQA-Extractor user interface.
  • Figure 2: FabricQA-Extractor's architecture overview
  • Figure 3: Schematic of the Answer ranker model
  • Figure 4: FabricQA-Ensemble relative improvement over FabricQA (top). FabricQA-Extractor improvement is shown at the bottom.
  • Figure 5: DrQA-Coherence relative improvement over DrQA-Adapted
  • ...and 1 more figures