Table of Contents
Fetching ...

ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models

Benjamin Newman, Yoonjoo Lee, Aakanksha Naik, Pao Siangliulue, Raymond Fok, Juho Kim, Daniel S. Weld, Joseph Chee Chang, Kyle Lo

TL;DR

A framework that leverages LMs to perform this task by decomposing it into separate schema and value generation steps is introduced, and it is found that even when LMs fail to fully reconstruct a reference table, their generated novel aspects can still be useful.

Abstract

When conducting literature reviews, scientists often create literature review tables - tables whose rows are publications and whose columns constitute a schema, a set of aspects used to compare and contrast the papers. Can we automatically generate these tables using language models (LMs)? In this work, we introduce a framework that leverages LMs to perform this task by decomposing it into separate schema and value generation steps. To enable experimentation, we address two main challenges: First, we overcome a lack of high-quality datasets to benchmark table generation by curating and releasing arxivDIGESTables, a new dataset of 2,228 literature review tables extracted from ArXiv papers that synthesize a total of 7,542 research papers. Second, to support scalable evaluation of model generations against human-authored reference tables, we develop DecontextEval, an automatic evaluation method that aligns elements of tables with the same underlying aspects despite differing surface forms. Given these tools, we evaluate LMs' abilities to reconstruct reference tables, finding this task benefits from additional context to ground the generation (e.g. table captions, in-text references). Finally, through a human evaluation study we find that even when LMs fail to fully reconstruct a reference table, their generated novel aspects can still be useful.

ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models

TL;DR

A framework that leverages LMs to perform this task by decomposing it into separate schema and value generation steps is introduced, and it is found that even when LMs fail to fully reconstruct a reference table, their generated novel aspects can still be useful.

Abstract

When conducting literature reviews, scientists often create literature review tables - tables whose rows are publications and whose columns constitute a schema, a set of aspects used to compare and contrast the papers. Can we automatically generate these tables using language models (LMs)? In this work, we introduce a framework that leverages LMs to perform this task by decomposing it into separate schema and value generation steps. To enable experimentation, we address two main challenges: First, we overcome a lack of high-quality datasets to benchmark table generation by curating and releasing arxivDIGESTables, a new dataset of 2,228 literature review tables extracted from ArXiv papers that synthesize a total of 7,542 research papers. Second, to support scalable evaluation of model generations against human-authored reference tables, we develop DecontextEval, an automatic evaluation method that aligns elements of tables with the same underlying aspects despite differing surface forms. Given these tools, we evaluate LMs' abilities to reconstruct reference tables, finding this task benefits from additional context to ground the generation (e.g. table captions, in-text references). Finally, through a human evaluation study we find that even when LMs fail to fully reconstruct a reference table, their generated novel aspects can still be useful.

Paper Structure

This paper contains 65 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Schematic of our literature review table generation task: (1) synthesize multiple input papers into a table with both (2) a schema (columns) and (3) values. Each row corresponds to an input paper.
  • Figure 2: Side-by-side comparison of a reference literature review table from an ArXiv paper LiuAda2023 and a model-generated table given the same input papers. The generated table has reconstructed two gold aspects: the foopink!60 pink and foolightblue blue aspects are the same, despite surface form differences (e.g., "Task" vs "Intended Application"). The generated table has also proposed two novel aspects that are still relevant and useful, like "evaluation metric" ( foolightgreen green) or "Annotation method" ( fooyellow yellow) not to be confused with reference table's "Annotations".
  • Figure 3: Pipeline for curating arxivDIGESTables involves extensive data cleaning and filtering. The full pipeline filters from 2.5 million starting tables published in 800,000 papers to 2,228 tables published in 1,723 papers. Data pipeline described in §\ref{['sec:data-curation']}.
  • Figure 4: Recall averaged over different contexts and systems. The band represents 95% confidence interval. Llama3 scorers have high recall, but low precision. Sentence Transformers (decontext) has the best trade-off.
  • Figure 5: Schema recall for GPT-3.5-Turbo and Mixtral 8x22, using various types of additional contexts. All scores are computed using our best metric: sentence transformer-based scorer with decontext featurizer. More context improves recall, but does not lead to completely reproducing reference table schemas.
  • ...and 3 more figures