Table of Contents
Fetching ...

Retrieve to Explain: Evidence-driven Predictions for Explainable Drug Target Identification

Ravi Patel, Angus Brayne, Rogier Hintzen, Daniel Jaroslawicz, Georgiana Neculae, Dane Corneil

TL;DR

Retrieve to Explain (R2E) presents a retrieval-based framework that scores each candidate drug target by evidence retrieved from a biomedical corpus, representing answers through their evidence and attributing scores to passages via Shapley values for faithful explanations. The model supports updating with new evidence without retraining and enables human-in-the-loop decision making, showing competitive performance against non-explainable baselines and genetics baselines in predicting clinical trial outcomes. It introduces a masked entity-linked corpus, a transformer-based retriever, and a Reasoner that uses a set-transformer to produce evidence-grounded scores, with Shapley-attributions and a bias-correction mechanism to manage literature bias. The work includes three new benchmarks (held-out literature, gene descriptions, and clinical trial outcomes) and demonstrates that explanation-driven evidence auditing (e.g., with GPT-4) can further improve predictive performance and transparency in high-stakes drug discovery tasks.

Abstract

Language models hold incredible promise for enabling scientific discovery by synthesizing massive research corpora. Many complex scientific research questions have multiple plausible answers, each supported by evidence of varying strength. However, existing language models lack the capability to quantitatively and faithfully compare answer plausibility in terms of supporting evidence. To address this, we introduce Retrieve to Explain (R2E), a retrieval-based model that scores and ranks all possible answers to a research question based on evidence retrieved from a document corpus. The architecture represents each answer only in terms of its supporting evidence, with the answer itself masked. This allows us to extend feature attribution methods such as Shapley values, to transparently attribute answer scores to supporting evidence at inference time. The architecture also allows incorporation of new evidence without retraining, including non-textual data modalities templated into natural language. We developed R2E for the challenging scientific discovery task of drug target identification, a human-in-the-loop process where failures are extremely costly and explainability paramount. When predicting whether drug targets will subsequently be confirmed as efficacious in clinical trials, R2E not only matches non-explainable literature-based models but also surpasses a genetics-based target identification approach used throughout the pharmaceutical industry.

Retrieve to Explain: Evidence-driven Predictions for Explainable Drug Target Identification

TL;DR

Retrieve to Explain (R2E) presents a retrieval-based framework that scores each candidate drug target by evidence retrieved from a biomedical corpus, representing answers through their evidence and attributing scores to passages via Shapley values for faithful explanations. The model supports updating with new evidence without retraining and enables human-in-the-loop decision making, showing competitive performance against non-explainable baselines and genetics baselines in predicting clinical trial outcomes. It introduces a masked entity-linked corpus, a transformer-based retriever, and a Reasoner that uses a set-transformer to produce evidence-grounded scores, with Shapley-attributions and a bias-correction mechanism to manage literature bias. The work includes three new benchmarks (held-out literature, gene descriptions, and clinical trial outcomes) and demonstrates that explanation-driven evidence auditing (e.g., with GPT-4) can further improve predictive performance and transparency in high-stakes drug discovery tasks.

Abstract

Language models hold incredible promise for enabling scientific discovery by synthesizing massive research corpora. Many complex scientific research questions have multiple plausible answers, each supported by evidence of varying strength. However, existing language models lack the capability to quantitatively and faithfully compare answer plausibility in terms of supporting evidence. To address this, we introduce Retrieve to Explain (R2E), a retrieval-based model that scores and ranks all possible answers to a research question based on evidence retrieved from a document corpus. The architecture represents each answer only in terms of its supporting evidence, with the answer itself masked. This allows us to extend feature attribution methods such as Shapley values, to transparently attribute answer scores to supporting evidence at inference time. The architecture also allows incorporation of new evidence without retraining, including non-textual data modalities templated into natural language. We developed R2E for the challenging scientific discovery task of drug target identification, a human-in-the-loop process where failures are extremely costly and explainability paramount. When predicting whether drug targets will subsequently be confirmed as efficacious in clinical trials, R2E not only matches non-explainable literature-based models but also surpasses a genetics-based target identification approach used throughout the pharmaceutical industry.
Paper Structure (70 sections, 16 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 70 sections, 16 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: drug target identification example. makes predictions based on retrieved evidence and provides explanations in terms of the evidence. Query: User queries are phrased in cloze-style, where [MASK] can be filled from a set of potential answers (named entities). For target identification, answers are the set of protein-coding genes (potential drug targets), and the query specifies a disease. Retrieval: retrieves the evidence most relevant to the query for each potential answer, where evidence here is taken from across the biomedical literature that mentions the specific answer. Prediction: The model scores each answer based on the supporting evidence. Explanation: Each answer score is directly and quantitatively attributed to its retrieved evidence using Shapley values. Here, the best evidence is indirect, based on the role of CD6 in mechanisms central to rheumatoid arthritis pathology.
  • Figure 2: architecture schematic. Illustration of inference and explanation. Input: A user-defined cloze-style query, a possible answer (named entity) to evaluate, and a corpus of evidence passages corresponding to that answer entity with entity mentions replaced with [ MASK]. Retriever: The query text is encoded with a transformer. All of the entity's evidence passages are encoded prior to inference, using the same encoder, and stored in a FAISS search index. The $k$ evidence passages with highest cosine similarity to the query are retrieved. Reasoner: Each evidence embedding is stacked with the query embedding. The resulting query-evidence pairs are layer-normalised before each pair is combined at corresponding dimensions into a single embedding using convolutional layers. All combined pair embeddings are passed to a set transformer, followed by a linear layer and sigmoid to obtain the binary probability. Shapley values for each pair (corresponding to each piece of evidence) can be computed to quantitatively explain the prediction. Output: To rank a set of answer entities $a_{1...N}$, binary probabilities are obtained independently for each. Shapley values attribute model predictions back to the evidence passages providing an explanation of the model's prediction.
  • Figure 3: Masked entity-linked corpus for Held-out Biomedical Literature experiments. Here we illustrate how the masked entity-linked corpus was partitioned to enable Reasoner/MLM and Retriever training, validation, and testing. Specifically the example of a 2020 year split setup is shown, as was used for Held-out Biomedical Literature experiments.
  • Figure 4: Relative Success on Clinical Trial Outcomes. Relative success for a given number of positive predictions (x-axis) for each model. The different numbers of positive predictions was achieved by varying the threshold for a positive prediction for each model.
  • Figure 5: performance across disease areas. AUROC in each PharmaProjects annotated disease area with more than 100 therapeutic hypotheses. Predictions by retrieving from literature-alone (-cor (lit)), genetics-alone (-uncor (genetic)), both genetics and literature (-cor (both)), or genetics and literature with LLM auditing (-audit (both)); in comparison to the genetics baseline (Genetic). The number of therapeutic hypotheses for each disease area are given in brackets.
  • ...and 1 more figures