Table of Contents
Fetching ...

CliniQ: A Multi-faceted Benchmark for Electronic Health Record Retrieval with Semantic Match Assessment

Zhengyun Zhao, Hongyi Yuan, Jingjing Liu, Haichao Chen, Huaiyuan Ying, Songchi Zhou, Yue Zhong, Sheng Yu

TL;DR

CliniQ introduces a public, large-scale benchmark for EHR retrieval with two realistic settings (Single-Patient and Multi-Patient) and a fine-grained semantic-match taxonomy. The dataset anchors on 1,000 MIMIC-III discharge summaries, 1,246 queries, and 77,206 chunk-level relevance judgments, enabling detailed analysis of string and semantic matches. Empirical results show BM25 as a strong baseline, while general-domain dense retrievers frequently outperform medical-domain ones; RRF fusion substantially improves recall by integrating lexical and semantic signals. The semantic-match breakdown reveals that implication and other semantic types pose substantial challenges, guiding targeted improvements and future research directions. Overall, CliniQ provides a scalable, multifaceted resource to advance EHR retrieval systems and their clinical impact.

Abstract

Electronic Health Record (EHR) retrieval plays a pivotal role in various clinical tasks, but its development has been severely impeded by the lack of publicly available benchmarks. In this paper, we introduce a novel public EHR retrieval benchmark, CliniQ, to address this gap. We consider two retrieval settings: Single-Patient Retrieval and Multi-Patient Retrieval, reflecting various real-world scenarios. Single-Patient Retrieval focuses on finding relevant parts within a patient note, while Multi-Patient Retrieval involves retrieving EHRs from multiple patients. We build our benchmark upon 1,000 discharge summary notes along with the ICD codes and prescription labels from MIMIC-III, and collect 1,246 unique queries with 77,206 relevance judgments by further leveraging powerful LLMs as annotators. Additionally, we include a novel assessment of the semantic gap issue in EHR retrieval by categorizing matching types into string match and four types of semantic matches. On our proposed benchmark, we conduct a comprehensive evaluation of various retrieval methods, ranging from conventional exact match to popular dense retrievers. Our experiments find that BM25 sets a strong baseline and performs competitively to the dense retrievers, and general domain dense retrievers surprisingly outperform those designed for the medical domain. In-depth analyses on various matching types reveal the strengths and drawbacks of different methods, enlightening the potential for targeted improvement. We believe that our benchmark will stimulate the research communities to advance EHR retrieval systems.

CliniQ: A Multi-faceted Benchmark for Electronic Health Record Retrieval with Semantic Match Assessment

TL;DR

CliniQ introduces a public, large-scale benchmark for EHR retrieval with two realistic settings (Single-Patient and Multi-Patient) and a fine-grained semantic-match taxonomy. The dataset anchors on 1,000 MIMIC-III discharge summaries, 1,246 queries, and 77,206 chunk-level relevance judgments, enabling detailed analysis of string and semantic matches. Empirical results show BM25 as a strong baseline, while general-domain dense retrievers frequently outperform medical-domain ones; RRF fusion substantially improves recall by integrating lexical and semantic signals. The semantic-match breakdown reveals that implication and other semantic types pose substantial challenges, guiding targeted improvements and future research directions. Overall, CliniQ provides a scalable, multifaceted resource to advance EHR retrieval systems and their clinical impact.

Abstract

Electronic Health Record (EHR) retrieval plays a pivotal role in various clinical tasks, but its development has been severely impeded by the lack of publicly available benchmarks. In this paper, we introduce a novel public EHR retrieval benchmark, CliniQ, to address this gap. We consider two retrieval settings: Single-Patient Retrieval and Multi-Patient Retrieval, reflecting various real-world scenarios. Single-Patient Retrieval focuses on finding relevant parts within a patient note, while Multi-Patient Retrieval involves retrieving EHRs from multiple patients. We build our benchmark upon 1,000 discharge summary notes along with the ICD codes and prescription labels from MIMIC-III, and collect 1,246 unique queries with 77,206 relevance judgments by further leveraging powerful LLMs as annotators. Additionally, we include a novel assessment of the semantic gap issue in EHR retrieval by categorizing matching types into string match and four types of semantic matches. On our proposed benchmark, we conduct a comprehensive evaluation of various retrieval methods, ranging from conventional exact match to popular dense retrievers. Our experiments find that BM25 sets a strong baseline and performs competitively to the dense retrievers, and general domain dense retrievers surprisingly outperform those designed for the medical domain. In-depth analyses on various matching types reveal the strengths and drawbacks of different methods, enlightening the potential for targeted improvement. We believe that our benchmark will stimulate the research communities to advance EHR retrieval systems.

Paper Structure

This paper contains 33 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Dataset Collection Pipeline of CliniQ. (b) Performance of various baseline retrieval methods in the Single-Patient and Multi-Patient Retrieval settings in CliniQ. The score reported for each model is an average of MRR, NDCG, and MAP for Single-Patient retrieval, and an average of MRR, NDCG@10, and recall@100 for Multi-Patient Retrieval. (c) Performance of various baseline retrieval methods regarding different match types in Single-Patient Retrieval. The score reported for each model is an average of MRR, NDCG, and MAP.
  • Figure 2: The prompt for drug name cleaning. drug_name represents the input drug name.
  • Figure 3: The prompt for relevance judgment. note and terms represents the chunk and the query terms to be annotated.
  • Figure 4: Cumulated Proportion of query length in word counts.
  • Figure 5: Distributions of different match types decomposed by the query type.