Table of Contents
Fetching ...

Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows

Clémence Sebe, Sarah Cohen-Boulakia, Olivier Ferret, Aurélie Névéol

TL;DR

The paper tackles the challenge of extracting rich, executable information about bioinformatics workflows from literature in low-resource settings, where annotated data are scarce. It introduces BioToFlow, a 52-article, 16-entity corpus, and evaluates four strategies: decoder-based few-shot NER, encoder-based NER using SoftCite and BioToFlow, fusion of corpora, and knowledge integration via pre-initialized encoders and tool vocabulary augmentation. Key findings show that a SciBERT-based NER on BioToFlow achieves a 70.4 F-measure, comparable to inter-annotator agreement, while decoder-based methods underperform; cross-corpus transfer and vocabulary-informed encodings yield strong results for tools and related entities. The work demonstrates that high-performance information extraction for bioinformatics workflows is feasible in low-resource contexts and provides a resource and methodological roadmap for future research on cross-source integration, few-shot learning, and linkage between literature-described workflows and repository code.

Abstract

Bioinformatics workflows are essential for complex biological data analyses and are often described in scientific articles with source code in public repositories. Extracting detailed workflow information from articles can improve accessibility and reusability but is hindered by limited annotated corpora. To address this, we framed the problem as a low-resource extraction task and tested four strategies: 1) creating a tailored annotated corpus, 2) few-shot named-entity recognition (NER) with an autoregressive language model, 3) NER using masked language models with existing and new corpora, and 4) integrating workflow knowledge into NER models. Using BioToFlow, a new corpus of 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a 70.4 F-measure, comparable to inter-annotator agreement. While knowledge integration improved performance for specific entities, it was less effective across the entire information schema. Our results demonstrate that high-performance information extraction for bioinformatics workflows is achievable.

Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows

TL;DR

The paper tackles the challenge of extracting rich, executable information about bioinformatics workflows from literature in low-resource settings, where annotated data are scarce. It introduces BioToFlow, a 52-article, 16-entity corpus, and evaluates four strategies: decoder-based few-shot NER, encoder-based NER using SoftCite and BioToFlow, fusion of corpora, and knowledge integration via pre-initialized encoders and tool vocabulary augmentation. Key findings show that a SciBERT-based NER on BioToFlow achieves a 70.4 F-measure, comparable to inter-annotator agreement, while decoder-based methods underperform; cross-corpus transfer and vocabulary-informed encodings yield strong results for tools and related entities. The work demonstrates that high-performance information extraction for bioinformatics workflows is feasible in low-resource contexts and provides a resource and methodological roadmap for future research on cross-source integration, few-shot learning, and linkage between literature-described workflows and repository code.

Abstract

Bioinformatics workflows are essential for complex biological data analyses and are often described in scientific articles with source code in public repositories. Extracting detailed workflow information from articles can improve accessibility and reusability but is hindered by limited annotated corpora. To address this, we framed the problem as a low-resource extraction task and tested four strategies: 1) creating a tailored annotated corpus, 2) few-shot named-entity recognition (NER) with an autoregressive language model, 3) NER using masked language models with existing and new corpora, and 4) integrating workflow knowledge into NER models. Using BioToFlow, a new corpus of 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a 70.4 F-measure, comparable to inter-annotator agreement. While knowledge integration improved performance for specific entities, it was less effective across the entire information schema. Our results demonstrate that high-performance information extraction for bioinformatics workflows is achievable.

Paper Structure

This paper contains 18 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Bioinformatics workflow representation schema.
  • Figure 2: Excerpt from the annotated corpus using the BRAT software.