Table of Contents
Fetching ...

Integrating curation into scientific publishing to train AI models

Jorge Abreu-Vicente, Hannah Sonntag, Thomas Eidens, Cassie S. Mitchell, Thomas Lemberger

TL;DR

The SourceData-NLP dataset, embedded into the academic publishing process to annotate segmented figure panels and captions, is evaluated for training AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task that assesses whether an entity is a controlled intervention target or a measurement object.

Abstract

High throughput extraction and structured labeling of data from academic articles is critical to enable downstream machine learning applications and secondary analyses. We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions. Natural language processing (NLP) was combined with human-in-the-loop feedback from the original authors to increase annotation accuracy. Annotation included eight classes of bioentities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases) plus additional classes delineating the entities' roles in experiment designs and methodologies. The resultant dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 articles in molecular and cell biology. We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task assessing whether an entity is a controlled intervention target or a measurement object. We also illustrate the use of our dataset in performing a multi-modal task for segmenting figures into panel images and their corresponding captions.

Integrating curation into scientific publishing to train AI models

TL;DR

The SourceData-NLP dataset, embedded into the academic publishing process to annotate segmented figure panels and captions, is evaluated for training AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task that assesses whether an entity is a controlled intervention target or a measurement object.

Abstract

High throughput extraction and structured labeling of data from academic articles is critical to enable downstream machine learning applications and secondary analyses. We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions. Natural language processing (NLP) was combined with human-in-the-loop feedback from the original authors to increase annotation accuracy. Annotation included eight classes of bioentities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases) plus additional classes delineating the entities' roles in experiment designs and methodologies. The resultant dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 articles in molecular and cell biology. We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task assessing whether an entity is a controlled intervention target or a measurement object. We also illustrate the use of our dataset in performing a multi-modal task for segmenting figures into panel images and their corresponding captions.
Paper Structure (47 sections, 10 figures, 6 tables)

This paper contains 47 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: SourceData-NLP process for embedding article curation as part of the academic publishing process.
  • Figure 2: Memorization vs. generalization performance of PubMedBERT and BioLinkBERT models. The bar chart compares the F1 scores, distinguishing between overall performance, and specific performance in memorization and generalization tasks. Bars with diagonal stripes indicate the large versions of the models, whereas solid bars represent the base versions. PubMedBERT is denoted by blue bars, and BioLinkBERT by orange bars.
  • Figure 2: Schemes like the one shown above do not need to be annotated.
  • Figure 3: Key elements missing in the figure caption should be added as a floating tag: in this example, the measured variable component AGO2 is missing and was added as a floating tag.
  • Figure 4: In the example above, tubulin is the normalizing component.
  • ...and 5 more figures