Table of Contents
Fetching ...

Biomedical Relation Extraction via Adaptive Document-Relation Cross-Mapping and Concept Unique Identifier

Yufei Shang, Yanrong Guo, Shijie Hao, Richang Hong

TL;DR

This work tackles document-level Bio-RE by addressing cross-sentence inference, data scarcity, and external knowledge integration. It introduces a two-stage framework: ADRCM fine-tuning to learn adaptive document–relation mappings and CUI RAG to fetch biomedical context, aided by Iteration-of-REsummary (IoRs) synthetic data generation (with termination at a threshold $\beta$). The approach fine-tunes LLaMA2-7B-Chat with LoRA on ADRCM-structured data and employs CUI RAG with hierarchical CUIs to retrieve relevant snippets, achieving state-of-the-art results on CDR, GDA, and BioRED. This combination improves cross-sentence reasoning, reduces aliasing during retrieval, and demonstrates the value of synthetic data and external knowledge for domain-specific relation extraction in biomedical texts, with practical implications for knowledge base construction and biomedical QA.

Abstract

Document-Level Biomedical Relation Extraction (Bio-RE) aims to identify relations between biomedical entities within extensive texts, serving as a crucial subfield of biomedical text mining. Existing Bio-RE methods struggle with cross-sentence inference, which is essential for capturing relations spanning multiple sentences. Moreover, previous methods often overlook the incompleteness of documents and lack the integration of external knowledge, limiting contextual richness. Besides, the scarcity of annotated data further hampers model training. Recent advancements in large language models (LLMs) have inspired us to explore all the above issues for document-level Bio-RE. Specifically, we propose a document-level Bio-RE framework via LLM Adaptive Document-Relation Cross-Mapping (ADRCM) Fine-Tuning and Concept Unique Identifier (CUI) Retrieval-Augmented Generation (RAG). First, we introduce the Iteration-of-REsummary (IoRs) prompt for solving the data scarcity issue. In this way, Bio-RE task-specific synthetic data can be generated by guiding ChatGPT to focus on entity relations and iteratively refining synthetic data. Next, we propose ADRCM fine-tuning, a novel fine-tuning recipe that establishes mappings across different documents and relations, enhancing the model's contextual understanding and cross-sentence inference capabilities. Finally, during the inference, a biomedical-specific RAG approach, named CUI RAG, is designed to leverage CUIs as indexes for entities, narrowing the retrieval scope and enriching the relevant document contexts. Experiments conducted on three Bio-RE datasets (GDA, CDR, and BioRED) demonstrate the state-of-the-art performance of our proposed method by comparing it with other related works.

Biomedical Relation Extraction via Adaptive Document-Relation Cross-Mapping and Concept Unique Identifier

TL;DR

This work tackles document-level Bio-RE by addressing cross-sentence inference, data scarcity, and external knowledge integration. It introduces a two-stage framework: ADRCM fine-tuning to learn adaptive document–relation mappings and CUI RAG to fetch biomedical context, aided by Iteration-of-REsummary (IoRs) synthetic data generation (with termination at a threshold ). The approach fine-tunes LLaMA2-7B-Chat with LoRA on ADRCM-structured data and employs CUI RAG with hierarchical CUIs to retrieve relevant snippets, achieving state-of-the-art results on CDR, GDA, and BioRED. This combination improves cross-sentence reasoning, reduces aliasing during retrieval, and demonstrates the value of synthetic data and external knowledge for domain-specific relation extraction in biomedical texts, with practical implications for knowledge base construction and biomedical QA.

Abstract

Document-Level Biomedical Relation Extraction (Bio-RE) aims to identify relations between biomedical entities within extensive texts, serving as a crucial subfield of biomedical text mining. Existing Bio-RE methods struggle with cross-sentence inference, which is essential for capturing relations spanning multiple sentences. Moreover, previous methods often overlook the incompleteness of documents and lack the integration of external knowledge, limiting contextual richness. Besides, the scarcity of annotated data further hampers model training. Recent advancements in large language models (LLMs) have inspired us to explore all the above issues for document-level Bio-RE. Specifically, we propose a document-level Bio-RE framework via LLM Adaptive Document-Relation Cross-Mapping (ADRCM) Fine-Tuning and Concept Unique Identifier (CUI) Retrieval-Augmented Generation (RAG). First, we introduce the Iteration-of-REsummary (IoRs) prompt for solving the data scarcity issue. In this way, Bio-RE task-specific synthetic data can be generated by guiding ChatGPT to focus on entity relations and iteratively refining synthetic data. Next, we propose ADRCM fine-tuning, a novel fine-tuning recipe that establishes mappings across different documents and relations, enhancing the model's contextual understanding and cross-sentence inference capabilities. Finally, during the inference, a biomedical-specific RAG approach, named CUI RAG, is designed to leverage CUIs as indexes for entities, narrowing the retrieval scope and enriching the relevant document contexts. Experiments conducted on three Bio-RE datasets (GDA, CDR, and BioRED) demonstrate the state-of-the-art performance of our proposed method by comparing it with other related works.
Paper Structure (16 sections, 7 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 7 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: This figure illustrates a document-level Bio-RE example from the GDA dataset 10.1007/GDA. Mentions of the same entity are highlighted in consistent colors for clarity. Solid underlines indicate disease entities, while dashed underlines represent gene entities. The lower right corner shows the retrieval results for the ADH1B gene from Wikipedia and National Center for Biotechnology Information Gene database.
  • Figure 2: The performance of LLMs on the test sets of the CDR, GDA, and BioRED datasets.
  • Figure 3: Overview of our framework. Gray, red, and blue are used to distinguish different entities, relations, and documents.
  • Figure 4: An example of IoRs. The generation process is independent, meaning that each step does not retain the memory of the previous steps.
  • Figure 6: An example of a vanilla and chain-of-thought prompt. Our proposed IoRs prompt is illustrated in Figure \ref{['fig_3']}. Using these three types of prompts, we generated three distinct sets of synthetic data with ChatGPT.