Table of Contents
Fetching ...

Ruling Out to Rule In: Contrastive Hypothesis Retrieval for Medical Question Answering

Byeolhee Kim, Min-Kyung Kim, Young-Hak Kim, Tae-Joon Jeon

Abstract

Retrieval-augmented generation (RAG) grounds large language models in external medical knowledge, yet standard retrievers frequently surface hard negatives that are semantically close to the query but describe clinically distinct conditions. While existing query-expansion methods improve query representation to mitigate ambiguity, they typically focus on enriching target-relevant semantics without an explicit mechanism to selectively suppress specific, clinically plausible hard negatives. This leaves the system prone to retrieving plausible mimics that overshadow the actual diagnosis, particularly when such mimics are dominant within the corpus. We propose Contrastive Hypothesis Retrieval (CHR), a framework inspired by the process of clinical differential diagnosis. CHR generates a target hypothesis $H^+$ for the likely correct answer and a mimic hypothesis $H^-$ for the most plausible incorrect alternative, then scores documents by promoting $H^+$-aligned evidence while penalizing $H^-$-aligned content. Across three medical QA benchmarks and three answer generators, CHR outperforms all five baselines in every configuration, with improvements of up to 10.4 percentage points over the next-best method. On the $n=587$ pooled cases where CHR answers correctly while embedded hypothetical-document query expansion does not, 85.2\% have no shared documents between the top-5 retrieval lists of CHR and of that baseline, consistent with substantive retrieval redirection rather than light re-ranking of the same candidates. By explicitly modeling what to avoid alongside what to find, CHR bridges clinical reasoning with retrieval mechanism design and offers a practical path to reducing hard-negative contamination in medical RAG systems.

Ruling Out to Rule In: Contrastive Hypothesis Retrieval for Medical Question Answering

Abstract

Retrieval-augmented generation (RAG) grounds large language models in external medical knowledge, yet standard retrievers frequently surface hard negatives that are semantically close to the query but describe clinically distinct conditions. While existing query-expansion methods improve query representation to mitigate ambiguity, they typically focus on enriching target-relevant semantics without an explicit mechanism to selectively suppress specific, clinically plausible hard negatives. This leaves the system prone to retrieving plausible mimics that overshadow the actual diagnosis, particularly when such mimics are dominant within the corpus. We propose Contrastive Hypothesis Retrieval (CHR), a framework inspired by the process of clinical differential diagnosis. CHR generates a target hypothesis for the likely correct answer and a mimic hypothesis for the most plausible incorrect alternative, then scores documents by promoting -aligned evidence while penalizing -aligned content. Across three medical QA benchmarks and three answer generators, CHR outperforms all five baselines in every configuration, with improvements of up to 10.4 percentage points over the next-best method. On the pooled cases where CHR answers correctly while embedded hypothetical-document query expansion does not, 85.2\% have no shared documents between the top-5 retrieval lists of CHR and of that baseline, consistent with substantive retrieval redirection rather than light re-ranking of the same candidates. By explicitly modeling what to avoid alongside what to find, CHR bridges clinical reasoning with retrieval mechanism design and offers a practical path to reducing hard-negative contamination in medical RAG systems.

Paper Structure

This paper contains 23 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of Contrastive Hypothesis Retrieval (CHR). Given a clinical question, CHR generates a contrastive hypothesis pair consisting of a target hypothesis ($H^+$) and a mimic hypothesis ($H^-$), retrieves documents using contrastive scoring that promotes target-aligned content while penalizing mimic-aligned content, and generates the final answer from the retrieved evidence.
  • Figure 2: Prompt template for contrastive hypothesis generation. The system prompt establishes the role of a medical specialist, and the user prompt instructs the model to generate a target hypothesis ($H^+$) for the likely correct diagnosis and a mimic hypothesis ($H^-$) for the most plausible incorrect alternative.
  • Figure 3: Case study showing why the negative (mimic) hypothesis $H^-$ is essential for discriminative retrieval. In the left box, bolded text highlights misleading co-occurrences of antivirals with parkinsonian symptoms (as side effects, not treatments), which led to the incorrect answer. In the right box, bolded text highlights the key sentences identifying amantadine as an antiviral repurposed for Parkinson's disease, directly leading to the correct answer.
  • Figure 4: Additional case study from MedQA (oncology domain). In the left box, bolded text highlights how retrieved documents focus on tamoxifen-related gynecologic bleeding, the dominant complication in the tamoxifen safety literature, which led to the incorrect answer. In the right box, bolded text highlights evidence describing tamoxifen's pro-coagulant effects via hepatic estrogen agonism, directly supporting deep venous thrombosis as the correct answer.
  • Figure 5: Sensitivity of CHR accuracy to the contrastive weight $\lambda$ on MedQA (Gemma-2-9B-It). The dashed red lines indicate the robust plateau region $\lambda \in [0.6, 1.2]$ where CHR consistently outperforms both baselines. Performance peaks at $\lambda = 1.0$.