Table of Contents
Fetching ...

BPQA Dataset: Evaluating How Well Language Models Leverage Blood Pressures to Answer Biomedical Questions

Chi Hang, Ruiqi Deng, Lavender Yao Jiang, Zihao Yang, Anton Alyakin, Daniel Alber, Eric Karl Oermann

TL;DR

The paper tackles whether language models can leverage clinical measurements, focusing on blood pressure, to answer biomedical questions. It introduces BPQA, a $100$-item BP-dependent QA dataset verified by medical students, and evaluates four LMs (BERT, BioBERT, MedAlpaca, GPT-3.5) in zero-shot QA across BP-related variants to isolate the impact of BP information. Key findings show larger LMs benefit more from BP data, label augmentation helps domain-specific models but can hurt generic ones, and context-specific labeling particularly aids GPT-3.5, with raw BP inputs yielding the best performance for GPT-3.5 (around $0.83$ accuracy). The work highlights the need for specialized benchmarks and suggests retrieval-augmented, context-aware approaches and measurement-oriented tokenizers to improve clinical LM reasoning with quantitative data.

Abstract

Clinical measurements such as blood pressures and respiration rates are critical in diagnosing and monitoring patient outcomes. It is an important component of biomedical data, which can be used to train transformer-based language models (LMs) for improving healthcare delivery. It is, however, unclear whether LMs can effectively interpret and use clinical measurements. We investigate two questions: First, can LMs effectively leverage clinical measurements to answer related medical questions? Second, how to enhance an LM's performance on medical question-answering (QA) tasks that involve measurements? We performed a case study on blood pressure readings (BPs), a vital sign routinely monitored by medical professionals. We evaluated the performance of four LMs: BERT, BioBERT, MedAlpaca, and GPT-3.5, on our newly developed dataset, BPQA (Blood Pressure Question Answering). BPQA contains $100$ medical QA pairs that were verified by medical students and designed to rely on BPs . We found that GPT-3.5 and MedAlpaca (larger and medium sized LMs) benefit more from the inclusion of BPs than BERT and BioBERT (small sized LMs). Further, augmenting measurements with labels improves the performance of BioBERT and Medalpaca (domain specific LMs), suggesting that retrieval may be useful for improving domain-specific LMs.

BPQA Dataset: Evaluating How Well Language Models Leverage Blood Pressures to Answer Biomedical Questions

TL;DR

The paper tackles whether language models can leverage clinical measurements, focusing on blood pressure, to answer biomedical questions. It introduces BPQA, a -item BP-dependent QA dataset verified by medical students, and evaluates four LMs (BERT, BioBERT, MedAlpaca, GPT-3.5) in zero-shot QA across BP-related variants to isolate the impact of BP information. Key findings show larger LMs benefit more from BP data, label augmentation helps domain-specific models but can hurt generic ones, and context-specific labeling particularly aids GPT-3.5, with raw BP inputs yielding the best performance for GPT-3.5 (around accuracy). The work highlights the need for specialized benchmarks and suggests retrieval-augmented, context-aware approaches and measurement-oriented tokenizers to improve clinical LM reasoning with quantitative data.

Abstract

Clinical measurements such as blood pressures and respiration rates are critical in diagnosing and monitoring patient outcomes. It is an important component of biomedical data, which can be used to train transformer-based language models (LMs) for improving healthcare delivery. It is, however, unclear whether LMs can effectively interpret and use clinical measurements. We investigate two questions: First, can LMs effectively leverage clinical measurements to answer related medical questions? Second, how to enhance an LM's performance on medical question-answering (QA) tasks that involve measurements? We performed a case study on blood pressure readings (BPs), a vital sign routinely monitored by medical professionals. We evaluated the performance of four LMs: BERT, BioBERT, MedAlpaca, and GPT-3.5, on our newly developed dataset, BPQA (Blood Pressure Question Answering). BPQA contains medical QA pairs that were verified by medical students and designed to rely on BPs . We found that GPT-3.5 and MedAlpaca (larger and medium sized LMs) benefit more from the inclusion of BPs than BERT and BioBERT (small sized LMs). Further, augmenting measurements with labels improves the performance of BioBERT and Medalpaca (domain specific LMs), suggesting that retrieval may be useful for improving domain-specific LMs.

Paper Structure

This paper contains 17 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Comparison of model performance on different BPQA variants.