CliniQ: A Multi-faceted Benchmark for Electronic Health Record Retrieval with Semantic Match Assessment
Zhengyun Zhao, Hongyi Yuan, Jingjing Liu, Haichao Chen, Huaiyuan Ying, Songchi Zhou, Yue Zhong, Sheng Yu
TL;DR
CliniQ introduces a public, large-scale benchmark for EHR retrieval with two realistic settings (Single-Patient and Multi-Patient) and a fine-grained semantic-match taxonomy. The dataset anchors on 1,000 MIMIC-III discharge summaries, 1,246 queries, and 77,206 chunk-level relevance judgments, enabling detailed analysis of string and semantic matches. Empirical results show BM25 as a strong baseline, while general-domain dense retrievers frequently outperform medical-domain ones; RRF fusion substantially improves recall by integrating lexical and semantic signals. The semantic-match breakdown reveals that implication and other semantic types pose substantial challenges, guiding targeted improvements and future research directions. Overall, CliniQ provides a scalable, multifaceted resource to advance EHR retrieval systems and their clinical impact.
Abstract
Electronic Health Record (EHR) retrieval plays a pivotal role in various clinical tasks, but its development has been severely impeded by the lack of publicly available benchmarks. In this paper, we introduce a novel public EHR retrieval benchmark, CliniQ, to address this gap. We consider two retrieval settings: Single-Patient Retrieval and Multi-Patient Retrieval, reflecting various real-world scenarios. Single-Patient Retrieval focuses on finding relevant parts within a patient note, while Multi-Patient Retrieval involves retrieving EHRs from multiple patients. We build our benchmark upon 1,000 discharge summary notes along with the ICD codes and prescription labels from MIMIC-III, and collect 1,246 unique queries with 77,206 relevance judgments by further leveraging powerful LLMs as annotators. Additionally, we include a novel assessment of the semantic gap issue in EHR retrieval by categorizing matching types into string match and four types of semantic matches. On our proposed benchmark, we conduct a comprehensive evaluation of various retrieval methods, ranging from conventional exact match to popular dense retrievers. Our experiments find that BM25 sets a strong baseline and performs competitively to the dense retrievers, and general domain dense retrievers surprisingly outperform those designed for the medical domain. In-depth analyses on various matching types reveal the strengths and drawbacks of different methods, enlightening the potential for targeted improvement. We believe that our benchmark will stimulate the research communities to advance EHR retrieval systems.
