Table of Contents
Fetching ...

Schema-Driven Information Extraction from Heterogeneous Tables

Fan Bai, Junmo Kang, Gabriel Stanovsky, Dayne Freitag, Mark Dredze, Alan Ritter

TL;DR

This paper introduces schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema, and presents a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages.

Abstract

In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. We use this collection of annotated tables to evaluate the ability of open-source and API-based language models to extract information from tables covering diverse domains and data formats. Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels, achieving F1 scores ranging from 74.2 to 96.1, while maintaining cost efficiency. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to model success and validate the practicality of distilling compact models to reduce API reliance.

Schema-Driven Information Extraction from Heterogeneous Tables

TL;DR

This paper introduces schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema, and presents a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages.

Abstract

In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. We use this collection of annotated tables to evaluate the ability of open-source and API-based language models to extract information from tables covering diverse domains and data formats. Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels, achieving F1 scores ranging from 74.2 to 96.1, while maintaining cost efficiency. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to model success and validate the practicality of distilling compact models to reduce API reliance.
Paper Structure (50 sections, 8 figures, 12 tables)

This paper contains 50 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Overview of Schema-Driven Information Extraction. The input includes two elements: the source code of a table and a human-authored extraction schema, outlining the target attributes and their data types. The output consists of a sequence of JSON records that conform to the extraction schema.
  • Figure 2: Left: Prompt formulation of our proposed method InstrucTE. Right: Illustration of our error-recovery strategy, which ensures the model compliance of the instructed cell traversal order and reduces inference costs.
  • Figure 3: Capability of various LLMs to perform Schema-Driven IE, measured using the Schema-to-Json benchmark. We employ Table-F1 for our two newly annotated datasets and provide a measure of human performance. For DisCoMatgupta2022discomat and SWDE swde, we adhere to their original evaluation metrics, i.e., Tuple-F1 and Page-F1 respectively, to support comparisons with established methods. In SWDE experiments, $\textit{k}$ represents the number of trained websites from each vertical. Due to API cost constraints, *InstrucTE's results are computed on a 1,600 webpage sample, with bootstrap confidence intervals calculated to validate the reliability of these performance estimates (margin of error for 95% confidence interval with 1000 samples is 0.00995.)
  • Figure 4: Ablation studies on various components of our InstrucTE (w/ code-davinci-002) on the ML tables. Interestingly, excluding the table caption improves performance. Our detailed analysis in Appendix \ref{['sec:error_analysis_of_caption']} reveals that low-quality captions (e.g., lack of specificity) may confuse the model, leading to inaccurate predictions.
  • Figure 5: Results of comparing various metrics, including token-level F1, SBERT, and BERTScore, to human judgment over different thresholds on ML tables. Numbers are computed over 677 sampled attributes that are paired with respective gold references.
  • ...and 3 more figures