Table of Contents
Fetching ...

Enhancing Document Retrieval for Curating N-ary Relations in Knowledge Bases

Xing David Wang, Ulf Leser

TL;DR

This work tackles the challenge of retrieving documents to complete $n$-ary relations for biomedical knowledge-base curation. It introduces EDEL, a dense bi-encoder retriever that leverages weak supervision from KB-linked publications and employs a layered margin loss along with KB-aware hard negative sampling to handle noisy signals. Two new benchmarks, Precision Oncology (PO) and Post-Translational Modifications (PTM), demonstrate state-of-the-art performance, with notable gains in NDCG@10 and EntityRecall over zero-shot and fine-tuned baselines. The approach promises more efficient and reliable evidence retrieval to support complex biomedical curation tasks and can be extended to broader domains requiring structured relation completion.

Abstract

Curation of biomedical knowledge bases (KBs) relies on extracting accurate multi-entity relational facts from the literature - a process that remains largely manual and expert-driven. An essential step in this workflow is retrieving documents that can support or complete partially observed n-ary relations. We present a neural retrieval model designed to assist KB curation by identifying documents that help fill in missing relation arguments and provide relevant contextual evidence. To reduce dependence on scarce gold-standard training data, we exploit existing KB records to construct weakly supervised training sets. Our approach introduces two key technical contributions: (i) a layered contrastive loss that enables learning from noisy and incomplete relational structures, and (ii) a balanced sampling strategy that generates high-quality negatives from diverse KB records. On two biomedical retrieval benchmarks, our approach achieves state-of-the-art performance, outperforming strong baselines in NDCG@10 by 5.7 and 3.7 percentage points, respectively.

Enhancing Document Retrieval for Curating N-ary Relations in Knowledge Bases

TL;DR

This work tackles the challenge of retrieving documents to complete -ary relations for biomedical knowledge-base curation. It introduces EDEL, a dense bi-encoder retriever that leverages weak supervision from KB-linked publications and employs a layered margin loss along with KB-aware hard negative sampling to handle noisy signals. Two new benchmarks, Precision Oncology (PO) and Post-Translational Modifications (PTM), demonstrate state-of-the-art performance, with notable gains in NDCG@10 and EntityRecall over zero-shot and fine-tuned baselines. The approach promises more efficient and reliable evidence retrieval to support complex biomedical curation tasks and can be extended to broader domains requiring structured relation completion.

Abstract

Curation of biomedical knowledge bases (KBs) relies on extracting accurate multi-entity relational facts from the literature - a process that remains largely manual and expert-driven. An essential step in this workflow is retrieving documents that can support or complete partially observed n-ary relations. We present a neural retrieval model designed to assist KB curation by identifying documents that help fill in missing relation arguments and provide relevant contextual evidence. To reduce dependence on scarce gold-standard training data, we exploit existing KB records to construct weakly supervised training sets. Our approach introduces two key technical contributions: (i) a layered contrastive loss that enables learning from noisy and incomplete relational structures, and (ii) a balanced sampling strategy that generates high-quality negatives from diverse KB records. On two biomedical retrieval benchmarks, our approach achieves state-of-the-art performance, outperforming strong baselines in NDCG@10 by 5.7 and 3.7 percentage points, respectively.

Paper Structure

This paper contains 22 sections, 2 equations, 1 figure, 8 tables.

Figures (1)

  • Figure 1: Overview of the EDEL framework. Left: Negative samples are grouped into predefined classes and assigned margin values $\mu$ based on their overlap with query/answer entities. Right: Query and document embeddings are optimized using a MultiMargin contrastive loss.