Table of Contents
Fetching ...

ProtoBERT-LoRA: Parameter-Efficient Prototypical Finetuning for Immunotherapy Study Identification

Shijia Zhang, Xiyu Ding, Kai Ding, Jacob Zhang, Kevin Galinsky, Mengrui Wang, Ryan P. Mayers, Zheyu Wang, Hadi Kharrazi

TL;DR

Identifying immune checkpoint inhibitor (ICI) studies in GEO is challenged by semantic ambiguity, extreme class imbalance, and limited labeled data in low-resource settings. ProtoBERT-LoRA integrates PubMedBERT with prototypical networks and Low-Rank Adaptation (LoRA) adapters to enforce class-separable embeddings via episodic prototypes while preserving biomedical knowledge. On a test set of 71 positives and 765 negatives, the approach achieves a F1-score of 0.624 (precision 0.481, recall 0.887), outperforming a rule-based system, traditional ML baselines, and direct PubMedBERT fine-tuning; applying the model to 44,287 unlabeled studies reduces manual review by 82%. Ablation shows combining prototypes with LoRA yields roughly a 29% improvement over stand-alone LoRA, highlighting the synergy between non-parametric prototype learning and parameter-efficient fine-tuning for low-resource biomedical NLP. The method offers a scalable tool for precise curation of rare immunotherapy studies in large genomic repositories, with potential applicability to other low-resource biomedical text mining tasks.

Abstract

Identifying immune checkpoint inhibitor (ICI) studies in genomic repositories like Gene Expression Omnibus (GEO) is vital for cancer research yet remains challenging due to semantic ambiguity, extreme class imbalance, and limited labeled data in low-resource settings. We present ProtoBERT-LoRA, a hybrid framework that combines PubMedBERT with prototypical networks and Low-Rank Adaptation (LoRA) for efficient fine-tuning. The model enforces class-separable embeddings via episodic prototype training while preserving biomedical domain knowledge. Our dataset was divided as: Training (20 positive, 20 negative), Prototype Set (10 positive, 10 negative), Validation (20 positive, 200 negative), and Test (71 positive, 765 negative). Evaluated on test dataset, ProtoBERT-LoRA achieved F1-score of 0.624 (precision: 0.481, recall: 0.887), outperforming the rule-based system, machine learning baselines and finetuned PubMedBERT. Application to 44,287 unlabeled studies reduced manual review efforts by 82%. Ablation studies confirmed that combining prototypes with LoRA improved performance by 29% over stand-alone LoRA.

ProtoBERT-LoRA: Parameter-Efficient Prototypical Finetuning for Immunotherapy Study Identification

TL;DR

Identifying immune checkpoint inhibitor (ICI) studies in GEO is challenged by semantic ambiguity, extreme class imbalance, and limited labeled data in low-resource settings. ProtoBERT-LoRA integrates PubMedBERT with prototypical networks and Low-Rank Adaptation (LoRA) adapters to enforce class-separable embeddings via episodic prototypes while preserving biomedical knowledge. On a test set of 71 positives and 765 negatives, the approach achieves a F1-score of 0.624 (precision 0.481, recall 0.887), outperforming a rule-based system, traditional ML baselines, and direct PubMedBERT fine-tuning; applying the model to 44,287 unlabeled studies reduces manual review by 82%. Ablation shows combining prototypes with LoRA yields roughly a 29% improvement over stand-alone LoRA, highlighting the synergy between non-parametric prototype learning and parameter-efficient fine-tuning for low-resource biomedical NLP. The method offers a scalable tool for precise curation of rare immunotherapy studies in large genomic repositories, with potential applicability to other low-resource biomedical text mining tasks.

Abstract

Identifying immune checkpoint inhibitor (ICI) studies in genomic repositories like Gene Expression Omnibus (GEO) is vital for cancer research yet remains challenging due to semantic ambiguity, extreme class imbalance, and limited labeled data in low-resource settings. We present ProtoBERT-LoRA, a hybrid framework that combines PubMedBERT with prototypical networks and Low-Rank Adaptation (LoRA) for efficient fine-tuning. The model enforces class-separable embeddings via episodic prototype training while preserving biomedical domain knowledge. Our dataset was divided as: Training (20 positive, 20 negative), Prototype Set (10 positive, 10 negative), Validation (20 positive, 200 negative), and Test (71 positive, 765 negative). Evaluated on test dataset, ProtoBERT-LoRA achieved F1-score of 0.624 (precision: 0.481, recall: 0.887), outperforming the rule-based system, machine learning baselines and finetuned PubMedBERT. Application to 44,287 unlabeled studies reduced manual review efforts by 82%. Ablation studies confirmed that combining prototypes with LoRA improved performance by 29% over stand-alone LoRA.

Paper Structure

This paper contains 1 section, 5 equations, 2 figures, 3 tables.

Table of Contents

  1. Acknowledgments

Figures (2)

  • Figure 1: This graph illustrates how the prototypical fine-tuning framework operates in a two-class, five-shot training scenario for each query. In the figure, $p_{pos}$ and $P_{neg}$ represent the prototypes for the ICI treatment and non-ICI classes, respectively. During training, only the $A$ and $B$ matrices are updated as part of the LoRA adaptation process.
  • Figure 2: t-SNE visualizations of test set embeddings from different model configurations. (a) Vanilla Pre-trained PubMedBERT without any task-specific adaptation. (b) PubMedBERT after conventional full fine-tuning. (c) ProtoBERT-LoRA, showing embeddings from the prototypical fine-tuning approach with LoRA adapters.