Enhancing Biomedical Knowledge Discovery for Diseases: An Open-Source Framework Applied on Rett Syndrome and Alzheimer's Disease

Christos Theodoropoulos; Andrei Catalin Coman; James Henderson; Marie-Francine Moens

Enhancing Biomedical Knowledge Discovery for Diseases: An Open-Source Framework Applied on Rett Syndrome and Alzheimer's Disease

Christos Theodoropoulos, Andrei Catalin Coman, James Henderson, Marie-Francine Moens

TL;DR

This work addresses the challenge of extracting structured disease knowledge from the rapidly expanding biomedical literature by introducing an open-source end-to-end framework that builds disease-centric knowledge from raw text. It develops two annotated datasets for Rett syndrome and Alzheimer's disease (ReDReS and ReDAD) and investigates multiple transformer-based architectures (LaMEL and LaMReD) with diverse entity and relation representations, complemented by distantly supervised variants (DiSReDReS and DiSReDAD). The study demonstrates strong performance of encoder-based models, highlights effective representations (e.g., R_G for binary and R_L/R_J/R_O for multi-class), and shows the value of probing to understand how transformers capture semantic relations, including cross-disease transfer capabilities. By providing open datasets, robust baselines, and insights into model behavior, the work advances scalable knowledge discovery in biomedicine and lays groundwork for evaluating additional large-language models on disease-focused relation extraction tasks.

Abstract

The ever-growing volume of biomedical publications creates a critical need for efficient knowledge discovery. In this context, we introduce an open-source end-to-end framework designed to construct knowledge around specific diseases directly from raw text. To facilitate research in disease-related knowledge discovery, we create two annotated datasets focused on Rett syndrome and Alzheimer's disease, enabling the identification of semantic relations between biomedical entities. Extensive benchmarking explores various ways to represent relations and entity representations, offering insights into optimal modeling strategies for semantic relation detection and highlighting language models' competence in knowledge discovery. We also conduct probing experiments using different layer representations and attention scores to explore transformers' ability to capture semantic relations.

Enhancing Biomedical Knowledge Discovery for Diseases: An Open-Source Framework Applied on Rett Syndrome and Alzheimer's Disease

TL;DR

Abstract

Paper Structure (21 sections, 1 equation, 19 figures, 14 tables)

This paper contains 21 sections, 1 equation, 19 figures, 14 tables.

Introduction
Related Work
Data Pipeline
Models
LaMEL model
LaMReD model
Experimental setup
Experiments with Additional Language Models
Results
Error Analysis
Distantly Supervised Datasets
Probing
Conclusion
Data Pipeline: Additional Information
Annotation Portal
...and 6 more sections

Figures (19)

Figure 1: Publication Trends: RS and AD
Figure 2: The pipeline starts with abstract retrieval using a natural language query. Next, entities are detected and linked to UMLS, followed by the co-occurrence graph generation. The final step is the dataset creation using the processed text (abstract retrieval and mention extraction steps) and co-occurrence graph.
Figure 3: Visualization of a subgraph of the co-occurrence graph for the Rett Syndrome corpus. Each node corresponds to a unique CUI with the related textual description and contains the semantic type. The edge label includes the number of times two entities co-occur in a sentence and a list with the sentence IDs where the connected entities are detected.
Figure 4: Graphical example of the ReDAD and ReDReS datasets. Each node corresponds to an entity with a textual description and semantic type. The edge label includes the annotated relation type and the sentence ID where the connected entities are detected.
Figure 5: Model Architecture of LaMReDA, LaMReDM (left), and LaMEL (right): Each model encodes the input sequence using BiomedBERT (large or base). For LaMReDA and LaMReDM, different tokens define the relation representation (A-P), passed through a linear projection layer, a dropout layer, and then a classification layer for prediction. The symbol # denotes element-wise addition and multiplication for LaMReDA and LaMReDM, respectively. For LaMEL, different tokens construct the entity representation (A-H), which are sent through a dropout layer and a linear layer to extract the projected entity representations. The symbols ; and * define the concatenation and the element-wise multiplication, respectively, for the LaMEL model.
...and 14 more figures

Enhancing Biomedical Knowledge Discovery for Diseases: An Open-Source Framework Applied on Rett Syndrome and Alzheimer's Disease

TL;DR

Abstract

Enhancing Biomedical Knowledge Discovery for Diseases: An Open-Source Framework Applied on Rett Syndrome and Alzheimer's Disease

Authors

TL;DR

Abstract

Table of Contents

Figures (19)