Table of Contents
Fetching ...

Automated Text Mining of Experimental Methodologies from Biomedical Literature

Ziqing Guo

TL;DR

The paper addresses automated extraction of experimental methodologies from biomedical literature by fine-tuning DistilBERT for multi-label methodology classification. It leverages ontology-driven labels from the NCBO-EDAM framework and builds a data pipeline from Entrez/BioC data sources, using ~32,000 biomedical abstracts and full texts for training. The approach is compared against non-fine-tuned baselines, demonstrating improved accuracy and F1 scores, particularly when full-text content is included. The study offers a scalable, ontology-informed pathway for metadata extraction in biomedicine with potential applications in literature curation and research workflows.

Abstract

Biomedical literature is a rapidly expanding field of science and technology. Classification of biomedical texts is an essential part of biomedicine research, especially in the field of biology. This work proposes the fine-tuned DistilBERT, a methodology-specific, pre-trained generative classification language model for mining biomedicine texts. The model has proven its effectiveness in linguistic understanding capabilities and has reduced the size of BERT models by 40\% but by 60\% faster. The main objective of this project is to improve the model and assess the performance of the model compared to the non-fine-tuned model. We used DistilBert as a support model and pre-trained on a corpus of 32,000 abstracts and complete text articles; our results were impressive and surpassed those of traditional literature classification methods by using RNN or LSTM. Our aim is to integrate this highly specialised and specific model into different research industries.

Automated Text Mining of Experimental Methodologies from Biomedical Literature

TL;DR

The paper addresses automated extraction of experimental methodologies from biomedical literature by fine-tuning DistilBERT for multi-label methodology classification. It leverages ontology-driven labels from the NCBO-EDAM framework and builds a data pipeline from Entrez/BioC data sources, using ~32,000 biomedical abstracts and full texts for training. The approach is compared against non-fine-tuned baselines, demonstrating improved accuracy and F1 scores, particularly when full-text content is included. The study offers a scalable, ontology-informed pathway for metadata extraction in biomedicine with potential applications in literature curation and research workflows.

Abstract

Biomedical literature is a rapidly expanding field of science and technology. Classification of biomedical texts is an essential part of biomedicine research, especially in the field of biology. This work proposes the fine-tuned DistilBERT, a methodology-specific, pre-trained generative classification language model for mining biomedicine texts. The model has proven its effectiveness in linguistic understanding capabilities and has reduced the size of BERT models by 40\% but by 60\% faster. The main objective of this project is to improve the model and assess the performance of the model compared to the non-fine-tuned model. We used DistilBert as a support model and pre-trained on a corpus of 32,000 abstracts and complete text articles; our results were impressive and surpassed those of traditional literature classification methods by using RNN or LSTM. Our aim is to integrate this highly specialised and specific model into different research industries.
Paper Structure (18 sections, 8 equations, 13 figures, 4 tables)

This paper contains 18 sections, 8 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Transformers, tokenizers and head make each model, and the mechanics of generated output is such as a funnel.
  • Figure 2: (Left) Example of a model page and model card for DistilBERTSanh_2019 base model, a smaller and faster transformer model than BERT, based on BERT model as a backbone model, using the same body as self-supervision. (Right) The statistic graph of inference times for the pre-trained BERT model is shown above, with model size and tensor type below. The data science engineer can use this model with a pipeline for modelling the mask language.
  • Figure 3: The value vectors correspond to q, k, and i indices and are multiplied by the distribution, assigning greater importance to the more significant vectors. The figure adapted from illustrated transformerjalammar2018
  • Figure 4: Multi-Head Attention involves multiple attention layers that operate simultaneously in parallel. The figure adapted from illustrated transformerjalammar2018
  • Figure 5: (a) on the top shows three sub-categories of experimental design. (b) on the bottom illustrates the 12 techniques of the topic. more details: https://bioportal.bioontology.org/ontologies/EDAM/
  • ...and 8 more figures