Table of Contents
Fetching ...

Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge

Julien Delile, Srayanta Mukherjee, Anton Van Pamel, Leonid Zhukov

TL;DR

The paper tackles information overload in biomedical literature by focusing on long-tail knowledge often missed by embedding-based RAG. It introduces a knowledge-graph–based IR (KG IR) that rebalances retrieved texts by undersampling overrepresented clusters and leveraging a biomedical KG built from entities and relations; KG IR achieves about twice the precision and recall of embedding-only methods, and a hybrid of KG and embedding signals outperforms both. Empirical results across multiple diseases show KG IR better captures diverse, recent, and impactful biomedical information, while the hybrid approach maximizes retrieval quality at practical volumes. This work suggests that integrating graph-structured retrieval with semantic embeddings can significantly improve biomedical QA and long-tail knowledge surfacing, with future directions including chain-of-thought prompting and KG-to-text generation to further enhance generation quality and coverage.

Abstract

Large language models (LLMs) are transforming the way information is retrieved with vast amounts of knowledge being summarized and presented via natural language conversations. Yet, LLMs are prone to highlight the most frequently seen pieces of information from the training set and to neglect the rare ones. In the field of biomedical research, latest discoveries are key to academic and industrial actors and are obscured by the abundance of an ever-increasing literature corpus (the information overload problem). Surfacing new associations between biomedical entities, e.g., drugs, genes, diseases, with LLMs becomes a challenge of capturing the long-tail knowledge of the biomedical scientific production. To overcome this challenge, Retrieval Augmented Generation (RAG) has been proposed to alleviate some of the shortcomings of LLMs by augmenting the prompts with context retrieved from external datasets. RAG methods typically select the context via maximum similarity search over text embeddings. In this study, we show that RAG methods leave out a significant proportion of relevant information due to clusters of over-represented concepts in the biomedical literature. We introduce a novel information-retrieval method that leverages a knowledge graph to downsample these clusters and mitigate the information overload problem. Its retrieval performance is about twice better than embedding similarity alternatives on both precision and recall. Finally, we demonstrate that both embedding similarity and knowledge graph retrieval methods can be advantageously combined into a hybrid model that outperforms both, enabling potential improvements to biomedical question-answering models.

Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge

TL;DR

The paper tackles information overload in biomedical literature by focusing on long-tail knowledge often missed by embedding-based RAG. It introduces a knowledge-graph–based IR (KG IR) that rebalances retrieved texts by undersampling overrepresented clusters and leveraging a biomedical KG built from entities and relations; KG IR achieves about twice the precision and recall of embedding-only methods, and a hybrid of KG and embedding signals outperforms both. Empirical results across multiple diseases show KG IR better captures diverse, recent, and impactful biomedical information, while the hybrid approach maximizes retrieval quality at practical volumes. This work suggests that integrating graph-structured retrieval with semantic embeddings can significantly improve biomedical QA and long-tail knowledge surfacing, with future directions including chain-of-thought prompting and KG-to-text generation to further enhance generation quality and coverage.

Abstract

Large language models (LLMs) are transforming the way information is retrieved with vast amounts of knowledge being summarized and presented via natural language conversations. Yet, LLMs are prone to highlight the most frequently seen pieces of information from the training set and to neglect the rare ones. In the field of biomedical research, latest discoveries are key to academic and industrial actors and are obscured by the abundance of an ever-increasing literature corpus (the information overload problem). Surfacing new associations between biomedical entities, e.g., drugs, genes, diseases, with LLMs becomes a challenge of capturing the long-tail knowledge of the biomedical scientific production. To overcome this challenge, Retrieval Augmented Generation (RAG) has been proposed to alleviate some of the shortcomings of LLMs by augmenting the prompts with context retrieved from external datasets. RAG methods typically select the context via maximum similarity search over text embeddings. In this study, we show that RAG methods leave out a significant proportion of relevant information due to clusters of over-represented concepts in the biomedical literature. We introduce a novel information-retrieval method that leverages a knowledge graph to downsample these clusters and mitigate the information overload problem. Its retrieval performance is about twice better than embedding similarity alternatives on both precision and recall. Finally, we demonstrate that both embedding similarity and knowledge graph retrieval methods can be advantageously combined into a hybrid model that outperforms both, enabling potential improvements to biomedical question-answering models.
Paper Structure (15 sections, 4 figures, 1 table)

This paper contains 15 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Knowledge Graphs enable an alternative retrieval-augmentation mechanism. A. Retrieval-augmentation enables question-answering from knowledge unseen during LLM training by inserting additional text chunks into the prompt (Retrieval) and summarising the augmented prompt following to the user question. B. A commonly used retrieval mechanism involves selecting text chunks from the corpus which are the most similar to the user question in embedding space (e.g. the top-20 closest neighbours using cosine similarity and OpenAI's text-embedding-ada-002 embedding engine). C. Proposed knowledge graph-based IR approach maps text chunks to the graph nodes and edges and use entity recognition in the question to retrieve the texts along the shortest path linking the identified entities.
  • Figure 2: Annotations from scientific articles structure the biomedical knowledge graph used for retrieval. A. A scientific article is annotated by extracting entities (NER) and the type of relationships linking entities (RE). B. Each sentence containing 1+ entity is mapped onto the knowledge graph, either on a link related two entities if it contains two entities that are linked semantically ($+$ sign) or to its individual entities otherwise ($\times$ sign).
  • Figure 3: Retrieval performance comparison between embedding similarity IR (blue), knowledge graph IR (orange) and hybrid method (green). The metrics compare each method's retrieval performance for the same task: retrieving biomedical text chunks which are relevant to the question "What are the known drug targets for treating < disease>?" over 8 diseases: asthma, pulmonary arterial hypertension, heart failure, hypertension, Parkinson's disease, Alzheimer's disease, liver cirrhosis, inflammatory bowel disease. Solid lines indicate the metric averages and transparent ribbon 95% confidence intervals. A-B. Recall@K and Precision@K. C. Number of clusters containing at least one retrieved text chunk, among the 200 clusters defined in 1536-dimension embedding space.
  • Figure 4: Characterization of differences between the ES IR and KG IR methods over the text embedding landscape. Each plot represents the 731k 1536-dimensional text chunk embeddings in two dimensions via UMAP transformation. A. The biomedical entity landscape illustrates the entity(ies) present in each text chunk. Text chunks containing a pair of genes (resp. drugs) are represented with same colour as text chunks containing a single gene (resp. drug). B. Disease area landscape. Only text chunks containing 1+ disease entity are represented. C. Question similarity landscape, colours indicate the cosine similarity of the text chunk embedding with the question embedding ("What are the known drug targets for treating Asthma?", also in D-E). Arrows indicate remote spots of text chunks that are most similar to the question embedding. D. High-density retrieval regions indicates the parts of the landscape where both methods are retrieving most of their chunks for K=200. Arrow indicates a secondary cluster of high-density retrieval for KG IR. Black dots represent the 355 text chunks that are part of the gold-standard dataset. E. Granular comparison of the retrieved documents for K=200. All dots represent the 442 curated text chunks.