Table of Contents
Fetching ...

Rapid Biomedical Research Classification: The Pandemic PACT Advanced Categorisation Engine

Omid Rohanian, Mohammadmahdi Nouriborji, Olena Seminog, Rodrigo Furst, Thomas Mendy, Shanthi Levanita, Zaharat Kadri-Alabi, Nusrat Jabin, Daniela Toale, Georgina Humphreys, Emilia Antonio, Adrian Bucher, Alice Norton, David A. Clifton

TL;DR

PPACE addresses the need to monitor biomedical research activity during health crises by providing a multilabel classifier that maps funded project abstracts to WHO-prioritized categories. It introduces an 8B LLaMA-based model fine-tuned on an augmented dataset whose rationales were generated by a 70B model, using LoRA for efficient adaptation and a rationale-first prompting approach. The authors publicly release the trained weights and an instruction-based dataset, and demonstrate that PPACE outperforms baselines on key multilabel metrics while highlighting dataset and category sparsity limitations. The work offers a practical tool for funders, policymakers, and researchers to track research trends, identify gaps, and align funding with global health priorities, with future directions toward smaller, more accessible models and optimized prompts.

Abstract

This paper introduces the Pandemic PACT Advanced Categorisation Engine (PPACE) along with its associated dataset. PPACE is a fine-tuned model developed to automatically classify research abstracts from funded biomedical projects according to WHO-aligned research priorities. This task is crucial for monitoring research trends and identifying gaps in global health preparedness and response. Our approach builds on human-annotated projects, which are allocated one or more categories from a predefined list. A large language model is then used to generate `rationales' explaining the reasoning behind these annotations. This augmented data, comprising expert annotations and rationales, is subsequently used to fine-tune a smaller, more efficient model. Developed as part of the Pandemic PACT project, which aims to track and analyse research funding and clinical evidence for a wide range of diseases with outbreak potential, PPACE supports informed decision-making by research funders, policymakers, and independent researchers. We introduce and release both the trained model and the instruction-based dataset used for its training. Our evaluation shows that PPACE significantly outperforms its baselines. The release of PPACE and its associated dataset offers valuable resources for researchers in multilabel biomedical document classification and supports advancements in aligning biomedical research with key global health priorities.

Rapid Biomedical Research Classification: The Pandemic PACT Advanced Categorisation Engine

TL;DR

PPACE addresses the need to monitor biomedical research activity during health crises by providing a multilabel classifier that maps funded project abstracts to WHO-prioritized categories. It introduces an 8B LLaMA-based model fine-tuned on an augmented dataset whose rationales were generated by a 70B model, using LoRA for efficient adaptation and a rationale-first prompting approach. The authors publicly release the trained weights and an instruction-based dataset, and demonstrate that PPACE outperforms baselines on key multilabel metrics while highlighting dataset and category sparsity limitations. The work offers a practical tool for funders, policymakers, and researchers to track research trends, identify gaps, and align funding with global health priorities, with future directions toward smaller, more accessible models and optimized prompts.

Abstract

This paper introduces the Pandemic PACT Advanced Categorisation Engine (PPACE) along with its associated dataset. PPACE is a fine-tuned model developed to automatically classify research abstracts from funded biomedical projects according to WHO-aligned research priorities. This task is crucial for monitoring research trends and identifying gaps in global health preparedness and response. Our approach builds on human-annotated projects, which are allocated one or more categories from a predefined list. A large language model is then used to generate `rationales' explaining the reasoning behind these annotations. This augmented data, comprising expert annotations and rationales, is subsequently used to fine-tune a smaller, more efficient model. Developed as part of the Pandemic PACT project, which aims to track and analyse research funding and clinical evidence for a wide range of diseases with outbreak potential, PPACE supports informed decision-making by research funders, policymakers, and independent researchers. We introduce and release both the trained model and the instruction-based dataset used for its training. Our evaluation shows that PPACE significantly outperforms its baselines. The release of PPACE and its associated dataset offers valuable resources for researchers in multilabel biomedical document classification and supports advancements in aligning biomedical research with key global health priorities.
Paper Structure (18 sections, 1 equation, 5 figures, 7 tables)

This paper contains 18 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Individual Label Distribution in the Training Set.
  • Figure 2: Top 12 Combined Label Distribution in the Training Set.
  • Figure 3: Correlation Heatmap of Research Categories in the Training Set.
  • Figure 4: F1 Score Comparison by Category between the baseline Llama3 8B and the finetuned PPACE model. The categories are sorted from least to most frequent as seen in the test set.
  • Figure 5: Individual Label Distribution in the Test Set.