Table of Contents
Fetching ...

Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings

Saptarshi Sengupta, Connor Heaton, Suhan Cui, Soumalya Sarkar, Prasenjit Mitra

TL;DR

This work tackles the high cost of domain-specific pre-training for medical QA by injecting domain knowledge through Knowledge Graph Embeddings (KGE) aligned to LM spaces via a simple MLP-based homogenization. By fusing homogenized KG signals and definition embeddings into open-domain models like BERT and RoBERTa, the approach achieves competitive performance on COVID-QA and PubMedQA without relying on extensive in-domain pre-training. Results show that non-domain models can closely match or surpass domain-specific baselines on these medical QA tasks, with performance gains varying by dataset and model, and ablations highlighting the value of external knowledge, especially when vocabulary overlap is limited. The method offers a scalable alternative to pre-training, with practical implications for efficient medical QA systems and broader knowledge integration in NLP.

Abstract

In Natural Language Processing (NLP), Machine Reading Comprehension (MRC) is the task of answering a question based on a given context. To handle questions in the medical domain, modern language models such as BioBERT, SciBERT and even ChatGPT are trained on vast amounts of in-domain medical corpora. However, in-domain pre-training is expensive in terms of time and resources. In this paper, we propose a resource-efficient approach for injecting domain knowledge into a model without relying on such domain-specific pre-training. Knowledge graphs are powerful resources for accessing medical information. Building on existing work, we introduce a method using Multi-Layer Perceptrons (MLPs) for aligning and integrating embeddings extracted from medical knowledge graphs with the embedding spaces of pre-trained language models (LMs). The aligned embeddings are fused with open-domain LMs BERT and RoBERTa that are fine-tuned for two MRC tasks, span detection (COVID-QA) and multiple-choice questions (PubMedQA). We compare our method to prior techniques that rely on a vocabulary overlap for embedding alignment and show how our method circumvents this requirement to deliver better performance. On both datasets, our method allows BERT/RoBERTa to either perform on par (occasionally exceeding) with stronger domain-specific models or show improvements in general over prior techniques. With the proposed approach, we signal an alternative method to in-domain pre-training to achieve domain proficiency. Our code is available here.

Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings

TL;DR

This work tackles the high cost of domain-specific pre-training for medical QA by injecting domain knowledge through Knowledge Graph Embeddings (KGE) aligned to LM spaces via a simple MLP-based homogenization. By fusing homogenized KG signals and definition embeddings into open-domain models like BERT and RoBERTa, the approach achieves competitive performance on COVID-QA and PubMedQA without relying on extensive in-domain pre-training. Results show that non-domain models can closely match or surpass domain-specific baselines on these medical QA tasks, with performance gains varying by dataset and model, and ablations highlighting the value of external knowledge, especially when vocabulary overlap is limited. The method offers a scalable alternative to pre-training, with practical implications for efficient medical QA systems and broader knowledge integration in NLP.

Abstract

In Natural Language Processing (NLP), Machine Reading Comprehension (MRC) is the task of answering a question based on a given context. To handle questions in the medical domain, modern language models such as BioBERT, SciBERT and even ChatGPT are trained on vast amounts of in-domain medical corpora. However, in-domain pre-training is expensive in terms of time and resources. In this paper, we propose a resource-efficient approach for injecting domain knowledge into a model without relying on such domain-specific pre-training. Knowledge graphs are powerful resources for accessing medical information. Building on existing work, we introduce a method using Multi-Layer Perceptrons (MLPs) for aligning and integrating embeddings extracted from medical knowledge graphs with the embedding spaces of pre-trained language models (LMs). The aligned embeddings are fused with open-domain LMs BERT and RoBERTa that are fine-tuned for two MRC tasks, span detection (COVID-QA) and multiple-choice questions (PubMedQA). We compare our method to prior techniques that rely on a vocabulary overlap for embedding alignment and show how our method circumvents this requirement to deliver better performance. On both datasets, our method allows BERT/RoBERTa to either perform on par (occasionally exceeding) with stronger domain-specific models or show improvements in general over prior techniques. With the proposed approach, we signal an alternative method to in-domain pre-training to achieve domain proficiency. Our code is available here.
Paper Structure (19 sections, 3 equations, 1 figure, 3 tables)

This paper contains 19 sections, 3 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Proposed Homogenization Method explained using an example entity hiv-1 infection. Here, E stands for the models' Vocabulary Embedding.