Enhancing Biomedical Knowledge Discovery for Diseases: An Open-Source Framework Applied on Rett Syndrome and Alzheimer's Disease
Christos Theodoropoulos, Andrei Catalin Coman, James Henderson, Marie-Francine Moens
TL;DR
This work addresses the challenge of extracting structured disease knowledge from the rapidly expanding biomedical literature by introducing an open-source end-to-end framework that builds disease-centric knowledge from raw text. It develops two annotated datasets for Rett syndrome and Alzheimer's disease (ReDReS and ReDAD) and investigates multiple transformer-based architectures (LaMEL and LaMReD) with diverse entity and relation representations, complemented by distantly supervised variants (DiSReDReS and DiSReDAD). The study demonstrates strong performance of encoder-based models, highlights effective representations (e.g., R_G for binary and R_L/R_J/R_O for multi-class), and shows the value of probing to understand how transformers capture semantic relations, including cross-disease transfer capabilities. By providing open datasets, robust baselines, and insights into model behavior, the work advances scalable knowledge discovery in biomedicine and lays groundwork for evaluating additional large-language models on disease-focused relation extraction tasks.
Abstract
The ever-growing volume of biomedical publications creates a critical need for efficient knowledge discovery. In this context, we introduce an open-source end-to-end framework designed to construct knowledge around specific diseases directly from raw text. To facilitate research in disease-related knowledge discovery, we create two annotated datasets focused on Rett syndrome and Alzheimer's disease, enabling the identification of semantic relations between biomedical entities. Extensive benchmarking explores various ways to represent relations and entity representations, offering insights into optimal modeling strategies for semantic relation detection and highlighting language models' competence in knowledge discovery. We also conduct probing experiments using different layer representations and attention scores to explore transformers' ability to capture semantic relations.
